JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation

Kai Liu; Yanhao Zheng; Kai Wang; Shengqiong Wu; Rongjunchen Zhang; Jiebo Luo; Dimitrios Hatzinakos; Ziwei Liu; Hao Fei; Tat-Seng Chua

JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation

Intermediate

Kai Liu, Yanhao Zheng, Kai Wang et al.2/22/2026

arXiv

Key Summary

•JavisDiT++ is a new AI that makes short videos and matching sounds from a text prompt, keeping sight and sound in sync.
•It uses one shared attention brain for both audio and video, plus two expert branches (one for audio, one for video) to polish each modality.
•A timing trick called TA-RoPE lines up audio and video at every moment, like matching drum hits to each frame of a drummer’s stick.
•The model is trained to match human taste using AV-DPO, which picks better samples based on quality, consistency, and synchrony.
•Despite using only about 1 million public training examples, it beats other open-source systems on quality and sync, and narrows the gap with Veo-3.
•On a key benchmark, JavisDiT++ gets lower FVD (better video), lower FAD (better audio), higher JavisScore (better sync), and lower DeSync (less lag).
•TA-RoPE brings stronger timing alignment without slowing inference, unlike older cross-attention or prior-based methods.
•Preference alignment (AV-DPO) makes people like the results more: over 70% win rate versus two strong baselines; DPO adds an extra 25% preferred videos.
•The design is simple and efficient: shared attention for cross-talk, modality-specific FFNs for craft, and clever positional IDs to avoid confusion.
•Code, model, and datasets are released to help others build better joint audio-video generation systems.

Why This Research Matters

When audio and video don’t line up, your brain notices right away, and the experience feels fake. JavisDiT++ makes sight and sound agree at each moment, which is crucial for convincing short videos, educational clips, and game scenes. It teaches a simple recipe that others can reuse: shared attention for cross-talk, separate experts for quality, explicit timing alignment, and human preference tuning. Because it’s efficient and trained on public data, more people can build on it without huge budgets. This can speed up creative workflows, help independent teams prototype films and games, and make learning materials more engaging. In short, it brings high-quality, synchronized audiovisual generation closer to everyday creators.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how a good movie feels right when the footsteps, the door slams, and the music all match what you see? That magic is audio and video working together in sync. Now imagine asking a computer to create both the pictures and the sounds from just a sentence like, “A brown bear walks toward the camera, growling in a green forest.” That’s exciting—and hard.

Before this paper, AI got pretty good at making images and then videos from text. It also learned to make sounds like rain or an engine. But making both together—so that the sound and picture line up in time and meaning—was still messy. Videos looked okay, but the sounds sometimes came too early, too late, or didn’t match the action. Open-source systems tended to either glue two separate models together or mash audio and video into one space that blurred important details.

The problem: joint audio-video generation (JAVG) needs three things at once. First, each part must be high quality (pretty video, clean audio). Second, the two parts must talk to each other so the sound and visuals fit the same story. Third, the timing must be tight—when a car zooms by in frame 37, the engine roar should spike at that same moment.

People tried several approaches. Some forced both audio and video into a single shared latent space. That made the model simpler but squashed modality-specific details: you lose video crispness or audio texture. Others built two big pipelines (one for audio, one for video) and tried to make them talk with extra cross-attention layers or special synchronization priors. This helped alignment a bit but made models big, slow, and tricky to train. Another path chained two separate generators in a row (like first text-to-video, then video-to-audio). That often produced off-sync pairs, because the second model guesses timing after the fact.

The missing piece was a simple, unified recipe that lets audio and video share what they need (so they agree on meaning and timing) while still letting each be its best self (so neither loses quality). And we also needed a way to teach the model what people actually prefer—because numbers don’t always capture human taste.

Why should anyone care? Think about short videos, games, VR, education, or film pre-visualization. A creator might type, “Waves crash on rocks under a stormy sky; gulls cry overhead,” and get a short clip where the gulls, splashes, and thunder feel perfectly placed. When sound and vision don’t match, your brain notices instantly—and the whole scene feels fake. Make them match, and your story sings.

So this paper brings three fresh ideas together. One, a unified transformer that shares attention across modalities but keeps separate expert layers (one for video, one for audio) so each can shine. Two, a timing trick called TA-RoPE that puts audio and video on the same timeline at a frame-by-frame level, without extra compute. Three, a teaching method called AV-DPO that nudges the model toward what people like, using paired comparisons across clear dimensions like quality, consistency, and synchrony.

With just about one million public examples and a compact backbone, JavisDiT++ scores better than prior open-source systems on quality, text match, and sync, and even closes in on a leading proprietary model. In short: simpler, smarter, and more in tune with how humans watch and listen.

02Core Idea

Aha! Moment in one sentence: Share the right parts, separate the rest, and line up time so audio and video grow together in lockstep—and then fine-tune it all toward human taste.

Multiple analogies:

Orchestra analogy: The shared attention is the conductor who keeps all sections together; the audio and video expert layers are the specialized groups (strings vs. brass) practicing their parts; TA-RoPE is the shared metronome; AV-DPO is the audience feedback after rehearsal.
Kitchen analogy: One shared prep table (attention) where chefs plan the meal jointly; two dedicated cooking stations (audio/video FFNs) to perfect each dish; a synced timer (TA-RoPE) for when to take each dish off the heat; diners’ ratings (AV-DPO) to tweak the recipe.
Comics with sound effects: Storyboards (attention) define the whole scene; separate artists ink visuals and letter “BANG!” sounds (expert FFNs); page numbers ensure each “BANG!” appears in the right panel (TA-RoPE); reader surveys (AV-DPO) pick the most exciting version.

Before vs. After:

Before: Models either mixed modalities too much (blurring details) or kept them too separate (needing heavy cross-links), and timing alignment was indirect or costly. Preference alignment for JAVG was basically missing.
After: JavisDiT++ uses one attention space for strong cross-talk, plus modality-specific FFNs to preserve each modality’s quality. TA-RoPE brings explicit, frame-level time matching at no extra inference cost. AV-DPO makes outputs look and sound more like what people actually prefer.

Why it works (the intuition):

Shared attention lets audio and video tokens “see” each other’s context, so the roar matches the race car and the splash matches the wave.
Modality-specific FFNs give each side its own craftsman’s bench, so details like textures, motion patterns, pitches, and timbres don’t get muddled.
TA-RoPE sets a clear, shared timeline for both streams and avoids positional ID collisions, so the model knows “this moment belongs together” rather than mixing up who goes where.
AV-DPO turns fuzzy human taste into clean training nudges by comparing pairs and pushing the model toward consistent winners across audio, video, and their alignment.

Building blocks (explained with the Sandwich pattern):

Sandwich: Joint Audio-Video Generation (JAVG)
- Top Bread (Hook): Imagine making a mini-movie and its soundtrack at the same time from just one sentence.
- Filling: JAVG is creating video frames and the matching sounds together from text. How it works:
  1. Read the text prompt to understand the scene and actions.
  2. Plan pictures (video tokens) and sounds (audio tokens) that fit the plan.
  3. Generate both so they match in meaning and timing. Why it matters: Without JAVG, you get pretty pictures with wrong sounds—or good sounds with mismatched visuals.
- Bottom Bread (Anchor): Type “A sports car races past; the engine roars,” and get a clip where the car speeds by as the roar peaks at just the right moment.
Sandwich: Self-Attention
- Top: You know how in a group conversation you listen harder to the person saying what matters most?
- Filling: Self-attention lets the model focus on the most relevant tokens when deciding what to generate next. How:
  1. Compare each token to every other.
  2. Score who matters most.
  3. Mix information weighted by scores. Why it matters: Without attention, the model treats every detail equally and misses key cues.
- Bottom: To match a drum hit, the model focuses on frames with the drummer’s stick striking.
Sandwich: Feed-Forward Network (FFN)
- Top: Think of a simple machine that takes parts in one side and pushes out polished pieces the other.
- Filling: FFNs transform token features after attention to sharpen and combine details. How:
  1. Expand features.
  2. Apply a nonlinearity.
  3. Compress back to size. Why it matters: Without FFNs, features stay fuzzy and unrefined.
- Bottom: After attention spots a wave crash, the FFN sharpens the foam texture.
Sandwich: Modality-Specific Mixture-of-Experts (MS-MoE)
- Top: You’d ask a guitar tech about strings and a cameraman about lenses, not one person for both.
- Filling: MS-MoE uses shared attention for cross-talk, then separate FFNs for audio and video tokens. How:
  1. Mix audio and video tokens in shared attention to agree on story beats.
  2. Split tokens by modality.
  3. Send audio tokens to the audio FFN and video tokens to the video FFN. Why it matters: Without separate experts, audio and video details smear each other and both suffer.
- Bottom: The engine’s timbre refines in the audio branch; tire smoke sharpens in the video branch.
Sandwich: Rotary Position Embedding (RoPE)
- Top: Like page numbers help you follow a book, positions help the model follow order.
- Filling: RoPE encodes where a token is in time and space so attention can use relative positions. How:
  1. Assign positions to tokens (time, height, width).
  2. Rotate queries/keys by these positions.
  3. Let attention compare positions consistently. Why it matters: Without positions, the model loses the sense of “when” and “where.”
- Bottom: It knows frame 12 comes after frame 11 and near the same place on screen.
Sandwich: Temporal-Aligned RoPE (TA-RoPE)
- Top: Think of a shared metronome that both the band and the light show follow.
- Filling: TA-RoPE lines up audio and video positions on the same absolute timeline, while offsetting other dimensions to avoid overlap. How:
  1. Give audio tokens time IDs that match video frame IDs.
  2. Offset audio’s other position dimensions so they don’t collide with video’s.
  3. Apply RoPE so attention respects this shared clock. Why it matters: Without TA-RoPE, sounds can drift from visuals or get position-confused.
- Bottom: The splash sound peaks right when the water visibly hits the rock.
Sandwich: Audio-Video Direct Preference Optimization (AV-DPO)
- Top: Like having judges pick the better of two clips, then teaching the model to make more like the winner.
- Filling: AV-DPO uses winner–loser pairs graded across audio quality, video quality, and alignment to push the model toward human favorites. How:
  1. Generate multiple candidates per prompt.
  2. Score them with reward models (audio, video, sync) and normalize.
  3. Form pairs where one is clearly better across modalities.
  4. Train the model to prefer winners while staying stable. Why it matters: Without preference learning, the model may optimize numbers that don’t match what people enjoy.
- Bottom: With DPO, people pick the new outputs more often in blind tests.

03Methodology

At a high level: Text prompt → tokenize and encode (text) → sample noisy audio/video latents (via frozen VAEs) → shared-attention DiT layers (audio+video tokens together) → split to modality-specific FFNs (MS-MoE) → TA-RoPE for frame-level timing → decode to waveform and frames → output synchronized audio-video.

Step-by-step (with what/why/examples):

Inputs and encoders

What: A text prompt is encoded by a frozen text encoder; audio and video will live in compact latent spaces thanks to frozen VAEs (video VAE from Wan2.1; audio VAE from AudioLDM2).
Why: Latent spaces cut memory and compute, so we can handle many frames and long sounds.
Example: “A turtle swims in turquoise water; birds chirp in the background.” Text features guide what to draw and what to sound like.

Flow matching as the generator’s engine

What: Use rectified flow (a diffusion-style training) to learn how to transform noise latents into clean audio and video latents over time.
Why: Flow matching gives smooth, stable generation and pairs well with transformers.
Example: Start from noisy audio/video codes and let the learned velocity field steer them into a clean turtle clip with watery gurgles.

Shared attention for cross-modal context

What: Concatenate audio and video tokens; run them through the same attention layers so they can attend to each other.
Why: The model learns that visual splashes match watery sounds; cars go with engine roars.
Example: A stick striking a drum in frame 17 attracts attention from nearby audio tokens that carry a “thwack” onset.

Modality-Specific FFNs (MS-MoE)

What: After attention, split tokens and pass them through separate audio and video FFNs.
Why: Keeps each modality’s details crisp (e.g., video texture vs. audio timbre) without interference.
Example: The audio FFN polishes the rippling water hiss; the video FFN sharpens bubbles and caustics.

Temporal-Aligned RoPE (TA-RoPE)

What: Align audio time IDs with video frame IDs and offset other audio position dimensions to avoid overlap with video positions.
Why: Gives a clear, collision-free shared timeline so sounds hit when actions happen.
Example: When a bear growls at time step i, both its frame tokens and the growl’s audio tokens share the same time position ID i.

Training pipeline in three stages

Stage A: Audio pretraining
- What: Train the audio FFN and audio bridge layers on 780K audio-text pairs.
- Why: Teach the model a rich audio vocabulary (rain, engines, birds) before mixing with video.
- Example: It learns bird chirps vs. thunder rumbles.
Stage B: Audio-video supervised fine-tuning (SFT)
- What: Apply LoRA to the DiT and train jointly on 330K audio-video-text triplets; keep VAEs and base transformer weights frozen.
- Why: Add the skill of generating both together without forgetting video quality.
- Example: For “sports car,” it practices pairing sharp motion and engine rev ramps.
Stage C: Audio-video DPO
- What: Collect about 25K preference pairs using reward models (AudioBox for audio aesthetics, VideoAlign for visual/motion, ImageBind for semantic matches, Synchformer for sync), normalize scores, form clear winners and losers, then train with AV-DPO.
- Why: Tune the model toward human tastes across audio quality, video quality, and synchrony.
- Example: Between two car clips, pick the one where the rev matches the acceleration curve and the scene looks crisper.

Secret sauce

Non-overlapping positional IDs with shared timing: Avoids confusion and boosts synchrony without extra inference cost.
Shared-attention + modality FFNs: Maximum cross-talk where it helps, clean specialization where it counts.
Modality-aware preference ranking: Prevents “good video but bad audio” from being labeled a winner; normalization keeps scores fair across reward scales.

Concrete mini-run:

Prompt: “A girl plays the piano.”
Text encoder extracts features about a girl, instrument, gentle indoor scene.
Sample noisy latents; TA-RoPE aligns future keys-onset frames with audio onsets.
Shared attention binds hand motion cues to piano-note transients.
Video FFN clarifies fingers and keys; audio FFN shapes note attacks and sustain.
AV-DPO biases selection toward outputs where keystrokes and sounds align tightly.
Decode to a short clip with matching piano audio.

What breaks without each step:

No shared attention: Audio and video don’t agree on events; sync suffers.
No modality-specific FFNs: Either audio or video (or both) lose detail.
No TA-RoPE: The model guesses timing indirectly; onsets drift.
No AV-DPO: Metrics may improve, but people like the results less in blind tests.

04Experiments & Results

The tests: The team evaluated quality (how nice it looks/sounds), text consistency (does it match the prompt), audio–video consistency (do sound and picture fit the same story), and synchrony (do onsets line up). They measured with standard and modern metrics across 10,140 prompts (JavisBench) and used a smaller set for ablations.

Sandwich: Fréchet Video Distance (FVD)

Top: Like checking if a new painting looks as good as a gallery of real paintings.
Filling: FVD measures how close generated videos are to real ones in a learned feature space; lower is better. How:
1. Encode videos with a pretrained video network.
2. Compare distributions (means/covariances) of real vs. generated features.
3. Compute a distance score. Why it matters: It reflects overall visual realism and motion quality.
Bottom: JavisDiT++ scores 141.5 vs. 204.1 (JavisDiT) and 194.2 (UniVerse-1), a clear win.

Sandwich: Fréchet Audio Distance (FAD)

Top: Like comparing the vibe of your song to a library of real songs.
Filling: FAD checks how close generated audios are to real ones; lower is better. How:
1. Encode audios with a pretrained audio model.
2. Compare distribution statistics.
3. Output one distance. Why it matters: Captures perceptual audio quality.
Bottom: JavisDiT++ gets 5.5 vs. 7.2 (JavisDiT), indicating cleaner, more natural sound.

Sandwich: JavisScore and DeSync

Top: Imagine clapping exactly when a dancer lands—good timing feels right.
Filling: JavisScore measures segment-by-segment cross-modal match (higher is better). DeSync estimates timing offset (lower is better). How:
1. Split clip into windows.
2. Compare audio and video features per window.
3. Aggregate to a score; separately estimate timing misalignment. Why it matters: Sync is the heart of believable audiovisuals.
Bottom: JavisDiT++ achieves JavisScore 0.159 (higher than baselines) and DeSync 0.832 (lower/better than 1.039 for JavisDiT and 0.929 for UniVerse-1).

Competition and scoreboard (with context):

Against JavisDiT and UniVerse-1 (both strong open-source baselines), JavisDiT++ wins on most metrics: better video (FVD), better audio (FAD), stronger text match, higher audio–video semantic fit (AV-IB/AVHScore), and tighter sync (JavisScore/DeSync). Think of it as moving from a solid B to a consistent A.
Efficiency: Inference stays fast (about 10 seconds latency in their setup), near the Wan2.1 backbone, while two-stream methods are notably slower.

Surprising findings:

TA-RoPE beats other sync tricks (like ST-Prior or extra frame-level cross-attention) while adding essentially zero inference cost. You get better sync for free.
Preference alignment (AV-DPO) had modest metric gains but big human-perceived gains: in blind tests, the DPO-tuned model produced over 25% more preferred videos than the SFT-only model.
Data matters doubly: Medium-quality, diverse data (330K) outperformed small high-quality-only or very large low-quality sets. Diversity plus solid quality was the best recipe.

Ablations highlight the design wins:

Architecture: Shared attention plus modality FFNs (MS-MoE) preserves video quality and adds strong audio, beating LoRA-only or full finetune baselines on a shared backbone.
Position encodings: Aligning time with non-overlapping position IDs is crucial. Interpolated or overlapping IDs hurt either audio quality or video quality and reduce sync.
DPO rewards: Modality-aware, normalized ranking works best; mixing metrics naively or skipping normalization reduces the usefulness of preference signals.

05Discussion & Limitations

Limitations:

Data scale: About 1M public entries is efficient but smaller than proprietary giants. More diverse, high-quality pairs—especially with complex actions and rare sounds—should help.
Model size: The 1.3B backbone (about 2.1B total with heads/experts) is compact; larger backbones might capture subtler correlations.
Training strategy: Heavy use of parameter-efficient LoRA keeps things affordable but may leave a little performance on the table versus full finetuning.
Controllability: Fine-grained control (e.g., exact beat timing, note pitches, or speech words) is not the focus here; adding precise control signals is an open path.
Task breadth: This work targets text-to-audio-video; expanding to audio-to-video, video-to-audio, or mixed conditioning would test generality further.

Required resources:

A modern GPU cluster for training (days on H100s were reported) and a few high-memory GPUs for inference.
Frozen, high-quality audio and video VAEs and a strong text encoder.
Curated reward models (AudioBox, VideoAlign, ImageBind, Synchformer) for preference data building.

When not to use:

Very long videos or high resolutions beyond the tested 2–5 seconds and 240p–480p ranges, unless adapted.
Tasks needing exact speech content or lyric alignment; a specialized speech/musical control model may be better.
Real-time generation on edge devices with tight latency/compute budgets without further optimization.

Open questions:

How far can TA-RoPE scale to longer sequences and higher frame rates?
What’s the best balance of shared vs. modality-specific parameters as models scale up?
Can we blend preference learning with richer human-in-the-loop feedback or interactive editing?
How to add precise controllability (beats, phonemes, Foley cues) while keeping simplicity and speed?
Can a single unified model handle T2AV, V2A, A2V, and multimodal prompts robustly?

06Conclusion & Future Work

Three-sentence summary: JavisDiT++ is a simple, unified system that generates videos and matching sounds from text with high quality and tight timing. It shares attention across modalities, uses expert branches per modality, aligns timing with TA-RoPE, and learns human preferences with AV-DPO. Trained on about one million public examples, it outperforms prior open-source baselines across quality, consistency, and synchrony while staying efficient.

Main achievement: Showing that a clean architecture—shared attention plus modality-specific FFNs—combined with an explicit, zero-cost timing aligner (TA-RoPE) and modality-aware preference tuning (AV-DPO) can set a new open-source bar for joint audio-video generation.

Future directions:

Scale data and backbone size to capture rarer sounds and subtler motions.
Extend controllability for music, speech, and Foley timing.
Generalize to any-to-any (text, audio, image, video) generation with a single model.
Explore full finetuning and improved reward models for even stronger preference alignment.

Why remember this: It turns the “hard parts” of audiovisual generation—quality, sync, and taste—into a neat trio of solutions that are simple, efficient, and effective together. The timing trick (TA-RoPE) is especially elegant: better synchrony without slowing down. And the preference step (AV-DPO) reminds us that what people prefer is the real goal, not just higher scores.

Practical Applications

•Rapid prototyping for filmmakers: draft scenes with matching Foley sounds from text.
•Game development: generate ambient loops and matching environmental motion for quick world-building.
•Education: create short science clips (e.g., volcano eruptions with correct rumbles) to engage students.
•Advertising and social media: produce eye-catching, on-brand clips with synced sound effects.
•Accessibility: generate descriptive, synchronized audio cues that match visual events.
•Virtual production: pre-visualize scenes with timing-accurate temp audio before full sound design.
•VR/AR experiences: quickly mock up immersive scenes with believable audiovisual timing.
•Content localization: adapt audiovisual elements to new regions while keeping sync intact.
•Creative music videos: align camera motion and visual beats with automatically generated musical accents.
•Storyboarding with sound: turn shot lists into timed animatics with placeholder audio.

Version: 1