DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

Xu Guo; Fulong Ye; Qichao Sun; Liyang Chen; Bingchuan Li; Pengze Zhang; Jiawei Liu; Songtao Zhao; Qian He; Xiangwang Hou

DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

Intermediate

Xu Guo, Fulong Ye, Qichao Sun et al.2/12/2026

arXiv

Key Summary

•DreamID-Omni is one model that can create, edit, and animate human-centered videos with matching voices, all in sync.
•It unifies three tasks—reference-based generation (R2AV), video editing (RV2AV), and audio-driven animation (RA2V)—by toggling which inputs you provide.
•A Symmetric Conditional Diffusion Transformer cleanly separates 'who' (identity and voice timbre) from 'how' and 'when' (motion, layout, timing).
•Dual-Level Disentanglement fixes common mix-ups where one person’s face gets another person’s voice, especially with multiple people.
•Synchronized RoPE locks each person’s image and voice into the same attention “slot,” preventing cross-identity leaks.
•Structured Captions use tags like <sub1> and <sub2> to map attributes and spoken lines to the correct person.
•A progressive three-stage training plan builds a strong creative prior first, then adds stricter tasks, so the model doesn’t overfit and forget how to be flexible.
•On a new benchmark (IDBench-Omni), DreamID-Omni matches or beats leading systems in video quality, audio quality, lip-sync, and identity-voice correctness.
•It even lowers speaker confusion rates in multi-person scenes, a tough real-world problem.
•This brings academic research closer to practical, commercial-grade video tools for creators, educators, and studios.

Why This Research Matters

Accurate, controllable video with matching voices helps educators, creators, and businesses produce content faster and more reliably. It reduces costly post-production fixes like manual lip-syncing or patching speaker mix-ups. By keeping face and voice tightly bound, it lowers the risk of misattribution in multi-person scenes, improving trust and safety. A single unified model simplifies production pipelines, cutting down tool-switching and error accumulation. It also opens doors to accessible tools: virtual presenters for remote learning, multilingual dubbing, and inclusive content. Finally, its clear structure (who, how, when) is a blueprint for future AI systems to stay controllable and transparent.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how movie-making uses lots of different jobs—actors, voice actors, editors—and if they don’t work together, the final film feels off? That’s what early AI video systems were like. We had impressive models that could make videos from text and others that could make audio, but they didn’t always cooperate well, especially when real human faces and real voices had to match. The world before this paper had two big streams of progress. First, powerful video generators could turn text into moving scenes. Second, audio models could make speech or sounds from text. Some newer systems even tried making both at the same time. But there were two missing pieces: precise control over which person appeared (identity control) and which voice they had (timbre control), and a single model that could do many related tasks without swapping tools.

🍞 Hook: Imagine trying to follow a class if your teacher reads a story but the pictures on the board don’t match the words. Confusing, right? 🥬 The Concept (Attention Mechanism): Attention in AI is a way for the model to focus on the most important parts of its input at each moment.

How it works: (1) Look at all inputs (words, frames, sounds). (2) Score each piece by how relevant it is. (3) Focus more on the high-scoring parts when deciding what to generate next. (4) Repeat at each step.
Why it matters: Without attention, the model treats everything as equally important—so it can miss key details and make mismatched audio and video. 🍞 Anchor: When the prompt says, “The girl whispers,” attention helps focus on “girl” and “whispers,” so the mouth motion is small and the voice is soft.

The problem researchers faced was practical: real uses need human-centric control. People want to: (1) generate new scenes using example faces and example voices (R2AV), (2) edit existing videos by swapping in a new face and voice (RV2AV), and (3) animate a person from a photo to match an input speech track (RA2V). Until now, these were treated as separate tasks with separate models. This made production clumsy: creators had to chain tools together, often losing quality or breaking sync along the way.

What did people try before? Video-only personalization (keeping a face consistent) worked, but it ignored voice. Talking-head animation followed an input speech track, but usually handled only one person well and struggled with multi-person conversations. Joint audio-video generators made nice demos from text, but they didn’t accept strong references (specific face/voice) and often failed to keep identity and voice locked together over time.

The gap was that everyone was solving slices of the same problem. The key observation in this paper is simple but powerful: all these tasks are really the same mapping—take a fixed identity (face and voice) and paint it onto a moving timeline (video layout, motions, and speech). If that’s true, then one model should handle all three tasks just by turning certain inputs on or off.

Why should anyone care? Because these capabilities map straight to daily life tools: safe and accurate dubbing, virtual presenters for classes, editing a team video so everyone looks and sounds consistent, and creating multi-person scenes without spending hours fixing lips or swapping audio in post-production. Also, when a model mixes up speakers in a conversation, it’s not just annoying; it can actually change the message or who is credited for saying what. Reducing that confusion makes AI video safer and more trustworthy.

02Core Idea

The 'Aha!' moment: Treat generation, editing, and animation as one problem—copy the right person (face + voice) into the right actions and words over time—then build one model that takes different combinations of inputs to do all of them.

Three analogies for the same idea:

Theater analogy: You have a cast list (who each actor is) and a script (what they say and do). Whether it’s a brand-new play, a rewrite of an old one, or dubbing a recorded performance, the job is to assign the right actor to the right lines at the right time.
Sticker-book analogy: Each person gets a clearly labeled sticker slot; you place the face sticker and the voice sticker into the same slot. Then you draw the background and motion around them.
Cooking analogy: Identities and voices are the ingredients; scene layout and timing are the recipe steps. Whether you start from scratch or remix leftovers, it’s still one kitchen and one chef.

🍞 Hook: Imagine your LEGO board has two lanes—one for pictures and one for sounds—and you want the same minifigure to appear in both lanes together. 🥬 The Concept (Symmetric Conditional Diffusion Transformer): It’s a two-lane (video and audio) generator that takes identity references (face images and voice clips) and structural guides (like source video or driving audio) in a balanced, mirrored way.

How it works: (1) Encode video and audio into latents. (2) Concatenate reference identities (face features for video, voice features for audio) to the noisy targets so the model can learn who is who. (3) Add structural conditions (source video layout or driving audio timing) directly to the latents to guide motion and timing. (4) Use cross-attention between lanes so lips and voices stay in step. (5) Switch tasks just by turning inputs on/off—no architecture change.
Why it matters: One model replaces many, reducing errors and keeping identity and lip-sync consistent across tasks. 🍞 Anchor: If you give face+voice refs with no source video, you get R2AV; if you add a source video, you get RV2AV; if you add a driving audio instead, you get RA2V.

🍞 Hook: Think of karaoke where the music (signal-level) and the on-screen lyrics (semantic-level) must match the singer. 🥬 The Concept (Dual-Level Disentanglement): It separates two kinds of mixing problems—signal binding (matching face and voice) and meaning binding (matching who says which line).

How it works: (1) Signal level: Synchronized RoPE assigns each person their own attention “slot,” shared by their face and voice. (2) Semantic level: Structured Captions tag each person (<sub1>, <sub2>, …) and tie attributes and dialogue lines to the right tag.
Why it matters: Without it, models often give Person A the voice or lines of Person B, especially in multi-person scenes. 🍞 Anchor: In a meeting video with three people, each person keeps their own face, their own voice, and their own speaking turns—no more swaps.

🍞 Hook: You know how subtitles and a movie scene should match? If the explosion happens when the word “boom” appears, it feels right. 🥬 The Concept (Bidirectional Cross-Attention): The video stream looks at the audio stream, and the audio stream looks at the video stream, so they keep each other aligned.

How it works: (1) The video lane pays attention to audio cues to shape mouth motions. (2) The audio lane pays attention to visual cues to time speech and tone. (3) They update each other at each layer.
Why it matters: This keeps lip shapes, timing, and emotion consistent across both streams. 🍞 Anchor: When the line is “I’m whispering,” you see small mouth movements and hear a soft voice at the same moment.

🍞 Hook: At a party, name tags stop you from mixing up people with similar hair or clothes. 🥬 The Concept (Structured Captions): A neatly organized script that uses anchor tokens like <sub1>, <sub2> to tie each description and spoken line to the exact person.

How it works: (1) An MLLM writes four parts: reference image captions, a target video caption, a target audio caption, and a joint caption with dialogue. (2) Every mention of a person uses the same tag. (3) Attributes and quotes stick to that tag everywhere.
Why it matters: The model no longer guesses who says what—it’s told explicitly. 🍞 Anchor: “<sub1> smiles and says, ‘Let’s start.’ <sub2> replies, ‘Great idea!’” ensures the smile and the first line go to <sub1>, and the reply to <sub2>.

🍞 Hook: Think of a school swimming class: first shallow water, then deeper, then full strokes. 🥬 The Concept (Multi-Task Progressive Training): A step-by-step training plan that first builds creativity, then adds strict controls, then fine-tunes all tasks together.

How it works: (1) Stage 1 In-pair Reconstruction: learn to recreate samples using their own face/voice refs, with masks so it can’t just copy-paste. (2) Stage 2 Cross-pair Disentanglement: learn to combine face and voice from different clips, forcing true identity/voice understanding. (3) Stage 3 Omni-Task Fine-tuning: train all tasks (R2AV, RV2AV, RA2V) together.
Why it matters: Jumping straight to strict tasks makes the model rigid and less creative; the curriculum avoids overfitting and keeps balance. 🍞 Anchor: After learning to invent scenes well, the model can obey precise edits without losing its imagination.

Why it works (intuition): The model separates “who” from “how/when.” Reference concatenation teaches stable identities and timbres, while structural addition provides motion and timing. Synchronized RoPE keeps each person’s face and voice in the same attention slot, cutting cross-talk. Structured Captions clear up any text ambiguity. Training progresses from flexible to strict so the model stays capable in all modes.

Building blocks (at a glance): Symmetric Conditional DiT, Dual-Level Disentanglement (Synchronized RoPE + Structured Captions), Bidirectional Cross-Attention, and Multi-Task Progressive Training—plus guided sampling at inference.

03Methodology

At a high level: Inputs (text + reference faces + reference $voices ± source$ $video ± driving$ audio) → encode into latents → assemble symmetric sequences (concat refs; add structure) → pass through stacked DiT blocks with bidirectional cross-attention and Syn-RoPE → predict denoised latents → decode to video and audio outputs.

Step-by-step (what, why, example):

Prepare inputs

What: Collect a text prompt; one or more reference face images; one or more reference voice timbres; optionally a source video (for editing) or a driving audio (for animation).
Why: This set lets the same model cover three tasks by toggling inputs.
Example: Two-person cafe scene. You provide <sub1> face image, <sub2> face image, their voice clips, and a script in the text prompt. No source video → R2AV.

Encode to latents

What: A video VAE encodes frames into a compact video latent; an audio VAE encodes the waveform into an audio latent.
Why: Diffusion works best in a smooth, lower-dimensional space.
Example: 8 seconds of 16 fps video and 8 seconds of speech become two latent sequences with matched timelines.

Symmetric conditional assembly

What: Build two sequences. For the video lane: [noisy target video; encoded reference faces] and add the source video structure if provided. For the audio lane: [noisy target audio; encoded reference voices] and add driving audio timing if provided.
Why: Concatenation of references teaches identity/timbre; addition of structure preserves layout and timing. The symmetry makes the audio and video lanes work the same way.
Example: If you’re editing, you add the masked source video layout to guide where people stand and move; if you’re animating, you add the driving audio to guide timing and prosody.

Synchronized RoPE (signal-level binding)

What: Assign non-overlapping positional segments (attention slots) to each identity’s face and voice; align audio/video sequence lengths via frequency scaling.
Why: It prevents cross-identity attention and makes each person’s face and voice “snap together” in the same slot.
Example: <sub1> uses positions 150–299 for both face and voice, <sub2> uses 300–449, so the model avoids mixing them.

Bidirectional cross-attention (audio↔video)

What: The video lane attends to audio to get correct lip shapes and pace; the audio lane attends to video to stay temporally and emotionally aligned.
Why: Mutual checking keeps sync tight and consistent.
Example: When the line is surprised, the mouth opens and the voice rises at the same time.

Structured Captions (semantic-level binding)

What: Use an MLLM to create four parts: (a) short per-image identity captions with tags; (b) a target video caption (environment, appearance, actions); (c) a target audio caption (ambience and voice feel); (d) a joint caption that assigns each quoted line to a tag.
Why: It gives explicit subject-to-attribute and subject-to-line links; no guessing.
Example: “<sub1> says, ‘I was hoping…’ <sub2> replies, ‘Me too…’” ensures mouths and voices are assigned correctly.

Denoising with DiT blocks

What: Stacked transformer blocks predict and remove noise step by step until clean latents emerge.
Why: Diffusion gradually refines details, which is stable for complex video+audio.
Example: Early steps outline positions and rhythm; later steps sharpen faces, textures, and voice clarity.

Decode to waveforms and pixels

What: The VAEs decode the final latents back into video frames and audio.
Why: We need human-viewable/listenable outputs.
Example: You get an 8-second clip where <sub1> and <sub2> appear, speak, and act as scripted.

Training (the recipe for skills): 🍞 Hook: Like learning to draw—first trace (guided), then sketch from memory (harder), then draw anything (flexible and precise). 🥬 The Concept (Weakly-Constrained Generative Priors): A flexible base skill the model learns when it’s guided but not over-controlled.

How it works: (1) Use references and text but don’t provide exact structure early on. (2) Mask parts of training targets so the model can’t just copy; it must genuinely generate. (3) Later, use cross-pairing to force real understanding of identity and timbre.
Why it matters: This base creativity prevents the final model from becoming rigid. 🍞 Anchor: The model can make the same person perform new actions in new places while staying recognizable.

🍞 Hook: Riding a bike with training wheels first, then without. 🥬 The Concept (Multi-Task Progressive Training): A curriculum of three stages.

How it works: (1) Stage 1 In-pair Reconstruction: Use each clip’s own face/voice as refs; compute loss mainly outside the ref zones (masked) to avoid copying. (2) Stage 2 Cross-pair Disentanglement: Mix face and voice from different clips; remove masks so the model must fully synthesize. (3) Stage 3 Omni-Task Fine-tuning: Mix R2AV, RV2AV, RA2V data so the model learns to switch modes by inputs.
Why it matters: It avoids overfitting to the easy, strongly-constrained tasks and keeps R2AV quality high. 🍞 Anchor: After stages 1–2 build skill, stage 3 teaches the model to obey exact edits and audio drives without losing identity or voice consistency.

Inference (steering the output): 🍞 Hook: A teacher grades with a rubric—content points, style points—then combines them into a final score. 🥬 The Concept (Multi-Condition Classifier-Free Guidance): Blend three predictions—no condition, text-only, and text+reference—using adjustable weights per stream (video and audio).

How it works: (1) Compute a base prediction with empty conditions. (2) Add a text-guidance push. (3) Add a face/voice-guidance push. (4) Tune weights to emphasize identity or creativity.
Why it matters: It stabilizes results and lets users dial in more identity fidelity or more variety. 🍞 Anchor: If you raise the identity weight, the generated face and timbre match the references even more tightly.

The secret sauce: (1) Symmetric conditional injection (concat refs, add structure) cleanly separates identity/timbre from layout/timing. (2) Syn-RoPE slots eliminate cross-identity leaks. (3) Structured Captions remove semantic guesswork. (4) Progressive training protects creative ability while adding precise control.

04Experiments & Results

The test: The authors introduce IDBench-Omni—a 200-case benchmark covering the three tasks: 100 for reference-based generation (R2AV), 50 for video editing (RV2AV), and 50 for audio-driven animation (RA2V). These include multi-person dialogues, varied identities and timbres, and in-the-wild recordings. They measure video quality and text alignment (AES, ViCLIP), identity similarity (ID-Sim with ArcFace), audio quality (PQ), semantic audio alignment (CLAP), speech accuracy (WER via Whisper), timbre similarity (speaker embedding cosine), and audio-visual sync (SyncNet confidence/distance), plus a tough metric: Speaker Confusion.

The competition: They compare to strong open-source pipelines (Qwen-Image + LTX-2, Qwen-Image + Ovi), leading video personalization/editing systems (Phantom, VACE, HunyuanCustom), and a top commercial model (Wan 2.6). Note that many baselines do not generate audio at all, or cannot take voice references, which makes DreamID-Omni’s full-stack performance notable.

The scoreboard with context (R2AV): DreamID-Omni reaches $AES≈0$ .618 (strong video quality), $ViCLIP≈13$ .911 (good text-video following), and ID- $Sim≈0$ .674/0.603 (single/multi-person identity match), while also delivering competitive audio ( $PQ≈6$ .290, $CLAP≈0$ .278), very low $WER≈0$ .052 (clear speech), solid timbre similarity, and better lip-sync (Sync-C high, Sync-D low). Think of it as getting an overall A when others get a mix of B’s and C’s—especially because DreamID-Omni is handling both pictures and sound at once and respecting references.

Editing (RV2AV): Against VACE and HunyuanCustom, DreamID-Omni posts higher video alignment ( $ViCLIP≈14$ .832) and identity similarity (ID- $Sim≈0$ .635), and it uniquely adds high-quality audio: low WER (~0.065), strong timbre similarity, and good sync. That’s like acing both the video and audio parts of a test when others only attempted the video section.

Animation (RA2V): Against Humo and HunyuanCustom, DreamID-Omni hits top-tier video metrics (e.g., $ViCLIP≈16$ .618) and competitive lip-sync, while reducing speaker misattribution in multi-person scenes thanks to Structured Captions and Syn-RoPE. In crowded conversations, that’s akin to keeping every character’s lines and lips perfectly aligned when rivals sometimes mix up who is speaking.

Surprising findings:

The progressive curriculum matters a lot: starting with weakly-constrained generation builds a creative prior so the final unified model stays flexible and doesn’t overfit to strict tasks. Naive multi-task-from-scratch performed worse on text-following and generalization.
Structured Captions dramatically reduce Speaker Confusion (triples worse without them in ablation). Just making the script explicit with <sub1>/<sub2> tags made a big difference.
Syn-RoPE simultaneously helped timbre correctness and lip-sync, showing that smarter positional “slots” clear up both who-speaks-when and how they sound.

User study: Professional creators rated DreamID-Omni best or near-best across text-video alignment, identity consistency, video quality, text-audio alignment, timbre similarity, audio quality, and lip-sync—evidence that the gains show up to human eyes and ears, not just metrics.

05Discussion & Limitations

Limitations:

Data hunger: The model benefits from large, diverse audio-video datasets with accurate diarization and clean references. Small or biased data can limit identity/timbre generalization.
Compute and speed: Dual-stream diffusion with cross-attention is heavy. Real-time applications or mobile devices may struggle without further optimization or distillation.
Caption dependency: Structured Captions rely on an MLLM; weak or ambiguous captions can hurt performance. Building robust captions for noisy, crowded scenes remains challenging.
Edge cases: Extreme lighting, heavy occlusion, fast camera motion, or overlapping speakers can still cause drift in identity or sync.
Domain shift: Highly unusual voices (e.g., strong accents, whisper-only speech) or non-human characters may need extra adaptation.

Required resources:

Multiple high-memory GPUs for training, plus storage for large datasets.
A captioning MLLM to produce consistent Structured Captions during training/inference.
Good front-end tools for diarization, face detection/cropping, and audio cleanup when preparing data.

When NOT to use:

Live, ultra-low-latency broadcasting where milliseconds matter.
Settings without consent or where identity/voice manipulation could violate privacy or local laws.
Music or singing-heavy tasks—timbre preservation may help, but expressive singing poses extra challenges.
Non-human characters or stylized cartoons without additional tuning.

Open questions:

Can we make this real-time via model compression, distillation, or sparsity without losing identity/timbre fidelity?
How to build safer systems: watermarking, provenance tracking, and robust detection against misuse?
Can we reduce dependence on MLLM captions, perhaps learning the structure implicitly while keeping the same gains?
How far can we scale to more speakers, longer scenes, and scene cuts while keeping zero Speaker Confusion?
Can this framework extend to 3D avatars, AR/VR, or multi-camera scenarios with the same identity/timbre guarantees?

06Conclusion & Future Work

Three-sentence summary: DreamID-Omni unifies three human-centric tasks—R2AV generation, RV2AV editing, and RA2V animation—inside a single, symmetric diffusion transformer. It keeps faces and voices correctly bound using Dual-Level Disentanglement (Synchronized RoPE + Structured Captions) and learns a balanced skill set through a progressive training curriculum. Across a new benchmark, it achieves state-of-the-art performance in video, audio, and audio-visual consistency, even lowering speaker confusion in multi-person scenes.

Main achievement: Showing that one carefully designed model can control who appears, who speaks, and how they move, all at once, by separating identity+timbre (who) from structure+timing (how/when) and then binding them back together safely.

Future directions: Make it faster and lighter for real-time use; reduce reliance on external captioning; scale to more speakers and longer stories; extend to singing, background music mixing, 3D avatars, and multi-camera edits; integrate safety layers like watermarking and consent checks.

Why remember this: It transforms a messy toolbox of separate models into one coherent system that creators can steer with faces, voices, and scripts—reliably keeping the right person with the right voice at the right time. That’s a practical recipe for the next wave of trustworthy AI video production.

Practical Applications

•Create branded virtual presenters that consistently keep the same face and voice across many videos.
•Edit a team meeting video to replace a speaker’s face and voice while preserving the original scene layout.
•Animate a portrait photo to speak a podcast or narration with precise lip-sync.
•Produce multilingual versions of training videos by swapping in translated audio and matching mouth movements.
•Generate multi-person dialogue scenes where each character’s face and voice are preserved and correctly attributed.
•Rapidly prototype commercials and social media clips with consistent on-screen talent and voiceovers.
•Assist film post-production by cleanly replacing dialogue (ADR) with accurate lip-sync and matching timbre.
•Create educational explainers with consistent teacher avatars narrating lessons.
•Build safer corporate communications by reducing speaker mix-ups in multi-speaker announcements.
•Personalize customer support avatars that match a company’s identity and approved voice.

Version: 1