MOVA: Towards Scalable and Synchronized Video-Audio Generation

SII-OpenMOSS Team; Donghua Yu; Mingshu Chen; Qi Chen; Qi Luo; Qianyi Wu; Qinyuan Cheng; Ruixiao Li; Tianyi Liang; Wenbo Zhang; Wenming Tu; Xiangyu Peng; Yang Gao; Yanru Huo; Ying Zhu; Yinze Luo; Yiyang Zhang; Yuerong Song; Zhe Xu; Zhiyu Zhang; Chenchen Yang; Cheng Chang; Chushu Zhou; Hanfu Chen; Hongnan Ma; Jiaxi Li; Jingqi Tong; Junxi Liu; Ke Chen; Shimin Li; Songlin Wang; Wei Jiang; Zhaoye Fei; Zhiyuan Ning; Chunguo Li; Chenhui Li; Ziwei He; Zengfeng Huang; Xie Chen; Xipeng Qiu

MOVA: Towards Scalable and Synchronized Video-Audio Generation

Intermediate

SII-OpenMOSS Team, Donghua Yu, Mingshu Chen et al.2/9/2026

arXiv

Key Summary

•MOVA is an open-source AI that makes videos and sounds at the same time so mouths, actions, and noises match perfectly.
•It uses two expert towers (one for video, one for audio) that talk to each other while creating, so the sound and picture stay in sync.
•A special Bridge with bidirectional cross-attention lets video guide audio and audio guide video during every step of generation.
•Aligned RoPE keeps audio and video on the same timeline so the same moment in time lines up across both streams.
•Training uses flow matching and independent noise schedules (dual sigma shift) so each modality learns at the pace it needs.
•A large, carefully filtered dataset with rich audio–video captions teaches MOVA to handle speech, sound effects, and music.
•Dual Classifier-Free Guidance at inference lets you dial between extra-strong sync and super-clear speech with simple sliders.
•MOVA achieves strong lip-sync, better audio–visual alignment, and high human preference compared to prior open systems.
•It supports both Text+Image to Video+Audio and Text-only to Video+Audio, and includes tools for efficient inference and fine-tuning.
•By releasing weights and code, MOVA enables researchers and creators to build synchronized stories, lessons, and games.

Why This Research Matters

MOVA helps videos feel real by making what you see and what you hear grow together, not patched afterward. That means smoother learning videos where lips match words in any language, more believable movie drafts with footsteps and music right on cue, and better accessibility when dubbing voices to match mouth shapes. Creators can control a simple trade-off between crystal-clear speech and ultra-tight sync, depending on their goals. Because MOVA is open-source with released weights and code, researchers and developers can extend it, improve it, and adapt it to new uses. This transparency accelerates progress for education, entertainment, and communication alike.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how watching a cartoon without sound feels empty, and watching a song with off-beat lyrics feels wrong? Our brains expect what we see and what we hear to match in time.

🥬 The Concept: Cascaded pipelines.

What it is: A cascaded pipeline makes video first, then separately makes audio (or the other way around), and stitches them together.
How it works:
1. Step 1: Generate a silent video (or an audio track).
2. Step 2: Use that result as a cue to generate the other modality.
3. Step 3: Align them after the fact and hope they fit.
Why it matters: Without real-time back-and-forth between audio and video, tiny mismatches grow into lip drift, late sound effects, and strange mood changes. 🍞 Anchor: Imagine filming a drum solo and adding the drum sounds later by guessing the hits. Even if you try hard, some hits won’t line up with the sticks.

The World Before: For years, most open models could make beautiful videos but ignored sound. To add audio, people chained models into those cascaded pipelines. Meanwhile, some closed systems (like Veo 3 and Sora 2) showed that making picture and sound together makes magic—precise lip sync, footsteps at the right time, and music that matches the scene—but they didn’t share code or weights. Researchers and creators wanting synchronized, controllable, and scalable video+audio were stuck.

🥬 The Concept: Joint (end-to-end) generation.

What it is: A single model that makes video and audio at the same time, letting both influence each other.
How it works:
1. Read the user’s text (and an optional first image).
2. Grow video frames and audio samples step-by-step together.
3. At every step, let video guide audio and audio guide video.
Why it matters: Real-time conversation between modalities prevents drift; lips match words, bangs match crashes, and music fits the mood. 🍞 Anchor: Like two dancers doing a duet, each adjusts to the other on every beat.

The Problem: Joint modeling is hard because video and audio are different “densities” of information. Video frames are big but sparse in time; audio is tiny per slice but very dense over time. If you fuse them naively, you risk time-mismatch (the same moment gets different positions), efficiency problems (too many tokens), and lost quality.

Failed Attempts: Early open models either kept things small (so quality plateaued), glued pretrained parts without deep interaction (so sync lagged), or synced to only simple sounds (like basic Foley) but not speech with perfect lips.

🥬 The Concept: Aligned time positions.

What it is: A way to make sure audio tokens and video tokens that happen at the same real-world moment get matched in the model.
How it works:
1. Put both modalities on one shared time ruler.
2. Adjust positional encodings so 1.5 seconds in video equals 1.5 seconds in audio.
3. Keep this alignment inside cross-attention.
Why it matters: Without it, the model compares Tuesday’s audio to Wednesday’s video—hello drift. 🍞 Anchor: It’s like setting the same clock on two watches before a race so both runners hear the same “Go!” at the same time.

The Gap: The community needed a scalable, open-source system that (1) fuses audio and video tightly, (2) trains on a large, well-annotated dataset with high-quality audio–visual captions, and (3) proves you can keep improving as the model and data grow.

Real Stakes: This matters to everyone. Students can learn with lip-synced bilingual explainers. Filmmakers can prototype scenes with fitting music and crisp sound effects. Streamers can create highlight reels where cheers hit right on the shot. Accessibility tools can dub voices while matching mouth shapes. And researchers can finally build on a shared, strong, open foundation.

02Core Idea

🍞 Hook: Imagine building a LEGO movie where the picture bricks and the sound bricks snap together at every step. If one shifts, the other adjusts instantly.

🥬 The Concept: Asymmetric dual-tower with a talking Bridge.

What it is: MOVA uses two expert towers—one for video (big) and one for audio (smaller)—connected by a bidirectional cross-attention Bridge that lets them exchange information every layer.
How it works:
1. Start from strong pretrained towers (video and audio) working in their own compressed spaces (latents).
2. Add a Bridge that shares what each tower knows about the current moment.
3. Align their internal clocks so the same time index matches across both.
4. Train them together with a technique (flow matching) that teaches how to move noisy latents into clean ones.
Why it matters: Without the Bridge, towers guess alone. Without alignment, they talk about different times. Without joint training, they never truly sync. 🍞 Anchor: Two walkie-talkies (video and audio) share live directions while hiking; a trail map (aligned time) keeps them on the same path.

The “Aha!” in one sentence: Let video and audio be great at their own jobs, but make them listen to each other at every step on the same timeline.

Multiple Analogies:

Orchestra: The video tower is the conductor’s hands, the audio tower is the music. The Bridge is the eye–ear connection so gestures and sound move together.
Sports replay: The video tower shows the swing; the audio tower times the bat’s crack. The Bridge makes sure the crack happens exactly at contact.
Cartoon dubbing: Video shapes the mouth; audio shapes the phonemes. The Bridge ensures vowels and lip-rounding match perfectly.

Before vs After:

Before: Separate pipelines guessed how to fit together later, often causing lip drift, off-beat SFX, and mood mismatches.
After: MOVA grows both streams together; lip-sync improves, impacts align with frames, and music follows visuals naturally.

🥬 The Concept: Flow matching (training the motion from noise to signal).

What it is: A recipe that teaches the model the velocity—the direction and speed—to move noisy latents toward clean video and audio latents.
How it works:
1. Mix clean latent with noise at a random time.
2. Ask the model to predict the velocity back to the clean target.
3. Repeat many times so it learns the whole path from noise to content.
Why it matters: Without a good path, the model wanders; with it, the growth is smooth and synchronized. 🍞 Anchor: Like teaching a paper boat which way to swim upstream at different points in a river.

Why It Works (intuition):

Specialization: Each tower stays an expert in its domain (video textures/motion; audio timbre/phonemes).
Communication: The Bridge fuses clues from both sides; a moving mouth boosts speech clarity; a loud bang sharpens motion timing.
Timing: Aligned positional encoding means the same millisecond lines up for both streams.
Pacing: Independent noise schedules let audio and video learn at their own rhythms yet stay coordinated.

Building Blocks:

Pretrained video tower (Mixture-of-Experts) for high-fidelity frames.
Pretrained audio tower for speech, sounds, and music.
Bridge with bidirectional cross-attention.
Aligned RoPE to share the same time grid.
Flow matching objective for stable, synchronized learning.
Dual sigma shift to pace modalities.
Dual Classifier-Free Guidance to tune alignment vs speech naturalness at inference.

03Methodology

At a high level: Text (and optional first frame) → Encode into latents → Dual towers grow video+audio together → Bridge fuses info each layer → Decode to pixels and waveform.

🥬 The Concept: Latent spaces with VAEs.

What it is: VAEs compress heavy video frames and dense audio waves into smaller, learnable codes.
How it works:
1. Video VAE turns frames into a compact spatiotemporal grid.
2. Audio VAE turns 48 kHz audio into a compact temporal code.
3. The model operates in these latents, then VAEs decode back to real video/audio.
Why it matters: Without latents, training and sampling would be too slow and memory-hungry. 🍞 Anchor: Shrinking a giant poster and a long cassette into neat index cards you can shuffle quickly, then re-expand later.

Step-by-step recipe:

Inputs and conditioning

Take a text prompt (e.g., “Two kids laugh as a dog splashes in a pool; upbeat ukulele plays”). Optionally take a first video frame as visual style anchor.
Extract structured visual details with a vision-language model and refine the prompt with an LLM so it matches training style.
Example: From a phone photo of a sunny backyard, the system notes: warm lighting, medium shot, child on left, dog near center. The LLM adds temporal narrative: child points, dog jumps, splash, laughter, ukulele strum.

Encode to latents

Run the first frame through the video VAE encoder to get video latent tokens.
Run a placeholder (or silence) through the audio VAE encoder to set up audio tokens.
Example: 8.05 seconds at 24 fps becomes 193 latent frames; audio becomes a matching-length temporal code.

Asymmetric dual-tower DiTs with a Bridge 🍞 Hook: Think of two classmates solving a project: one draws (video), one composes (audio). They pass notes every few minutes so the music and pictures fit.

🥬 The Concept: Asymmetric dual-tower architecture.

What it is: A larger video Transformer and a smaller audio Transformer run in parallel.
How it works:
1. Each layer updates video and audio tokens.
2. A Bridge module adds bidirectional cross-attention so each side can look at the other’s hidden states.
3. Towers keep their strengths while still aligning timing and events.
Why it matters: Without two specialized towers, quality drops; without a Bridge, sync suffers. 🍞 Anchor: The art kid and the music kid coordinate to make a perfect school play.

🥬 The Concept: Bidirectional cross-attention (the Bridge).

What it is: A two-way listen-and-tell block—video listens to audio; audio listens to video—inside each interaction layer.
How it works:
1. Compute attention from video queries to audio keys/values (audio→video info).
2. Compute attention from audio queries to video keys/values (video→audio info).
3. Merge both into the towers’ states before the next layer.
Why it matters: Without two-way flow, only one modality leads; with it, they co-lead. 🍞 Anchor: Two friends whispering clues into each other’s ears while solving a mystery.

Keep time aligned with Aligned RoPE 🍞 Hook: You know how two people clapping stay in sync if they count the same beats?

🥬 The Concept: Aligned RoPE (rotary position embeddings alignment).

What it is: A way to convert positions so audio and video share the same time scale.
How it works:
1. Measure video’s and audio’s latent frame rates.
2. Scale video time indices to audio’s denser grid.
3. Apply the same positional rotation to both so “second 3.2” lines up in attention.
Why it matters: Without it, attention compares the wrong moments and drifts. 🍞 Anchor: Setting both metronomes to the same BPM so nobody gets ahead.

Learn the path with Flow Matching

Mix clean latents with noise at a sampled time t for audio and video separately.
The model predicts the velocity to move back toward clean; loss encourages accurate direction for both modalities.
Example: At t=0.6, the model sees a half-noisy smiling mouth and a fuzzy “ah” sound; it learns to sharpen both together.

Pace each modality with Dual Sigma Shift 🍞 Hook: When learning piano and dance, you might practice fingers slowly and steps a bit faster so both improve.

🥬 The Concept: Dual sigma shift (independent noise schedules).

What it is: Audio and video pick their own noise levels per step during training and sampling.
How it works:
1. Sample tv for video, ta for audio independently.
2. Use schedules that start gentle for audio (to learn timbre) and stronger for video (to learn denoising), then align later.
3. Optionally tweak at inference without retraining.
Why it matters: A single pace fits neither perfectly; dual paces balance stability and detail. 🍞 Anchor: Training wheels off the bike later for one kid, sooner for the other, so both ride well.

Decode to outputs

After iterative refinement, decode video latents back to frames and audio latents back to waveforms.
Example: You get a 720p clip where a person says “Hello!” and their lips match the syllables exactly.

Secret Sauce:

The Bridge is light but powerful, letting towers stay mostly intact while adding rich fusion.
Aligned RoPE prevents time slips.
Dual sigma shift stabilizes learning across modalities.
A fine-grained data curation pipeline with audio–visual captioning teaches not just “what” but “when.”

Inference controls with Dual CFG 🍞 Hook: Like two volume knobs—one for following the text idea, one for tightening the A/V sync.

🥬 The Concept: Dual Classifier-Free Guidance (dual CFG).

What it is: A way to separately boost text guidance and cross-modal alignment guidance during sampling.
How it works:
1. Run branches with/without text and with/without Bridge.
2. Combine predictions with two sliders: sT (text) and sB (Bridge alignment).
3. Turn sB up for tighter sync; turn sT up for stronger instruction following.
Why it matters: One size doesn’t fit all; some prompts need tighter lips, others need clearer narration. 🍞 Anchor: Adjusting bass and treble until the song fits the scene perfectly.

Data engineering with audio–visual captioning 🍞 Hook: You know how a good recipe card tells you ingredients and timing? Good training needs both the what and the when.

🥬 The Concept: Audio–video captioning pipeline.

What it is: A three-stage system to clean videos, rate quality, and write rich captions that cover visuals, speech, and sounds.
How it works:
1. Preprocess: normalize aspect, fps, clip length; detect speech and scene cuts.
2. Filter: keep only clips with good audio, good video, and strong A/V sync.
3. Caption: use models to describe visuals, transcribe speech, name sounds, then merge into one coherent prompt.
Why it matters: Without accurate, time-aware labels, the model can’t learn precise synchronization. 🍞 Anchor: A sports play-by-play plus color commentary that together tell the whole story at the right times.

04Experiments & Results

The Test: MOVA was measured on audio quality (clarity and variety), sync (timing and meaning), lip-sync accuracy, and instruction following in multi-speaker scenes. Think of it as grading singing pitch, dancing on-beat, mouth movements matching lyrics, and who is speaking when.

The Competition: It was compared to LTX-2 and OVI (both joint generators) and to a cascaded setup (WAN2.1 video + MMAudio audio). All ran at 720p and followed their best practices.

Scoreboard with context:

Audio fidelity/diversity (IS) and speech quality (DNSMOS): MOVA-360p scored IS ≈ 4.27 and DNSMOS ≈ 3.80—like getting an A when others are at B or B+. The cascaded system had decent variety but struggled with clear, intelligible speech across settings.
A/V alignment: With dual CFG, MOVA’s DeSync dropped substantially and IB-Score rose—like consistently clapping on beat while also playing the right melody for the scene. Compared to LTX-2 and OVI, MOVA showed a big jump in semantic alignment, meaning sounds better matched what was happening visually.
Lip sync (LSE-D↓, LSE-C↑): MOVA with dual CFG hit top performance—think of mouths landing syllables right on time, frame after frame.
Multi-speaker cpCER: MOVA-720p achieved a lower error (≈0.149), which is like correctly attributing lines to the right character more often than others. Dual CFG sometimes raises cpCER slightly because pushing sync very hard can soften strict instruction-following like speaker tags, but overall accuracy remained strong.

Surprising findings:

Scaling resolution: MOVA held up when moving from 360p to 720p. Even with more visual detail to manage, it kept sync and improved multi-speaker attribution, a sign the foundation scales well.
Emergent T2VA: Even without a first frame, MOVA (Text-only to Video+Audio) stayed well synchronized and sometimes improved IS, showing it learned a robust joint prior rather than relying on a starting picture to force structure.
Dual CFG trade-off: Turning up the Bridge guidance (sB) systematically improved sync metrics but could slightly degrade speech naturalness and instruction following. This is a useful, predictable knob depending on your goal: perfect lips for dubbing vs. maximum narration clarity for educational videos.

Numbers as grades:

Where baselines got something like B− to B in sync and semantic match, MOVA frequently reached B+ to A−, and with dual CFG, often A-level lip-sync.
Human preference arenas put MOVA on top across pairwise comparisons, confirming that objective wins felt better to viewers too.

Takeaway: By combining a strong architecture with time alignment, pacing controls, and rich data, MOVA beats prior open approaches in the categories people notice most—mouths matching words, sounds landing on action, and overall believability.

05Discussion & Limitations

Limitations:

Singing and complex music: The 1.3B audio tower can struggle with rich harmonics and long-range musical structure, so choir pieces or intricate solos may not be perfect.
Crowded talk scenes: Very fast turn-taking or overlaps can still confuse who’s speaking, leading to occasional lip assignment errors.
Physics of sound: Subtle causal effects (like thunder arriving after lightning) aren’t explicitly enforced and may be missed in tricky scenes.
Compute and length: High resolution and 8-second clips are expensive; longer stories need more clever compression or hierarchical generation.

Required Resources:

Strong GPUs/NPUs and distributed training/inference setups for high-res outputs.
The released weights and codebase support efficient inference, LoRA fine-tuning, and prompt enhancement, but creators still benefit from VRAM-rich hardware for best speed.

When NOT to Use:

If you only need top-tier standalone music generation with long-form structure.
If you need guaranteed diarization-perfect multi-speaker dubbing without post-checks.
If ultra-low-latency generation on tiny devices is mandatory.

Open Questions:

Can we scale the audio tower and data further to master singing and complex music while keeping sync?
What’s the best way to inject physical causality (like propagation delays) without hurting creativity?
Can we design lighter, hierarchical token schemes to tell longer stories at lower cost?
How can we build better multi-speaker labels and active-speaker detection to reduce confusion?

Overall, MOVA is a big step for open, synchronized A/V generation—strong today and structured to keep improving with scale.

06Conclusion & Future Work

Three-sentence summary: MOVA is an open, scalable model that generates video and audio together so they stay synchronized, using two expert towers linked by a smart Bridge and aligned timelines. It’s trained with a robust flow-matching recipe, dual noise schedules, and a carefully curated, richly captioned dataset, then guided at inference by dual CFG to balance sync and naturalness. The result is strong lip-sync, tight audio–visual alignment, and broad capability across speech, sound effects, and music, with code and weights released for the community.

Main achievement: Proving that a dual-tower + Bridge architecture, aligned in time and paced per modality, scales to high-fidelity, synchronized video–audio generation in the open.

Future directions:

Scale the audio tower and dataset for richer music and singing.
Add physics-aware objectives and better multi-speaker supervision.
Explore hierarchical or blockwise generation for longer, cheaper stories.
Keep improving prompt enhancement and fine-tuning tools for creators.

Why remember this: MOVA shows how to make pictures and sound grow together—not patched after—so stories feel real. It’s a practical, open blueprint others can build on, from classrooms to film sets to research labs.

Practical Applications

•Create bilingual, lip-synced educational explainers where the teacher’s mouth matches English and Chinese speech.
•Prototype film scenes with matched Foley (doors, footsteps, rain) and music that follows the camera and mood.
•Auto-dub talking-head content while preserving tight mouth–speech alignment for global audiences.
•Generate game highlight reels with cheers, commentary, and hit sounds that land exactly on key actions.
•Produce social media shorts (9:16) with aligned captions, mouth movement, and background music.
•Design virtual presenters or avatars that speak naturally with precise lip motion and expressive timing.
•Add scene-aware soundtracks to silent archival footage or animation, synchronized to visual events.
•Assist accessibility tools by aligning signers’ or speakers’ visual cues with clear, timed audio narration.
•Build multimodal storyboards where early drafts already have synced dialogue and environment sounds.
•Fine-tune MOVA with LoRA on niche domains (e.g., cooking shows, nature documentaries) for specialized sync and sound styles.

Version: 1