Mode Seeking meets Mean Seeking for Fast Long Video Generation
Key Summary
- â˘Short videos are easy for AI to make sharp and lively, but long videos need stories and memory, and there isnât much training data for that.
- â˘This paper splits the job in two: one part learns long-term story flow (mean-seeking), and another part keeps close-up details crisp (mode-seeking).
- â˘They use one shared âbrainâ (encoder) with two small âmouthsâ (heads): a Flow Matching head for global coherence and a Distribution Matching head that copies a shortâvideo expert locally.
- â˘The local head learns by sliding along the long video in short windows and matching a frozen shortâvideo teacher using a modeâseeking reverseâKL trick (DMD/VSD).
- â˘The global head learns long-range narrative by supervised flow matching on scarce but real long videos.
- â˘At test time, they only use the local head as a fast, fewâstep sampler, so long videos render quickly and stay sharp.
- â˘Results show better motion, sharpness, and consistency over minutes than strong baselines like mixed-length SFT and autoregressive rollouts.
- â˘An ablation study shows all parts matter: decoupled heads, slidingâwindow teacher matching, and longâvideo SFT each add crucial gains.
- â˘This closes the âfidelityâhorizonâ gap: the model keeps shortâclip realism while scaling to minuteâlong, coherent stories.
- â˘The approach needs a good shortâvideo teacher and some longâvideo data, but itâs modular and can pair with autoregressive methods in the future.
Why This Research Matters
Long, coherent, highâquality videos power storytelling, education, simulation, and games by keeping characters and scenes consistent over time. This method makes such videos faster to generate and more realistic by combining global story learning with local detail preservation. Creators can prototype films or ads that stay sharp from start to finish, while students can watch science explainers that visually remain stable. Game and robotics simulations benefit from believable worlds that donât drift or glitch after a few seconds. By needing only a modest set of long videos and a good shortâvideo teacher, this approach is practical for many teams, not just the largest labs.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
You know how making a 5âsecond clip of a bouncing ball is easy, but filming a whole soccer game is harder because the players, score, and storyline must stay consistent for a long time? Thatâs the main challenge in AI video: short clips are easy to learn and look great; long videos need memory and structure.
The World Before: AI got really good at short video generation thanks to tons of web data with a few seconds per clip. With that abundance, models learned to make crisp textures, punchy motion, and high fidelity. People often scaled image/video models by mixing data of different sizes or resolutions. For images, higher resolution is mostly an interpolation of the same local patches (a 256Ă256 and a 1024Ă1024 photo share similar local textures). This idea tempted researchers to treat time like pixels: just mix short and long clips and let the model âfigure it out.â
The Problem: Time isnât like resolution. A minuteâlong video isnât just a zoomedâin 5âsecond clipâitâs an extrapolation: new events happen, characters change places, and causes lead to effects. Long videos demand narrative structure and stable identities over many seconds. But highâquality long videos are rare and expensive to collect and clean. When a single model tries to learn everything at once from scarce long videos, it often forgets how to make the sharp local details that shortâvideo experts mastered. Outputs become soft, less detailed, and less lively.
Failed Attempts:
- Mixedâlength supervised finetuning (SFT): Train on a soup of different clip durations. It helps some with longer context but often blurs local detail because the objective averages over uncertainty, especially when longâvideo data is scarce.
- Teacherâonly approaches with autoregressive rollouts: Use a great shortâvideo teacher to generate longer videos step by step. They keep local sharpness at first but drift over time (small errors compound), get overâsaturated, or turn too static to avoid mistakes. They also lack true longârange grounding since the teacher never learned minuteâscale structure.
- Trainingâfree horizon hacks (e.g., noise schedules, positional tweaks): Extend length without retraining. Handy, but they canât teach real story memory or fix longârange identity consistency.
The Gap: We need a way to keep the shortâclip excellence (crisp looks and lively motion) while truly learning the bigâpicture story over minutesâwithout demanding a mountain of longâvideo data. In other words, decouple local fidelity from longâterm coherence, so each can be learned with its best available signal.
Real Stakes: Imagine a kidâs cartoon where the main characterâs hat changes color randomly after 20 seconds, or a travel video where the city background suddenly hops across the globe. Long, coherent videos matter for:
- Storytelling and film prototypes where characters and scenes persist.
- Games and simulation where the world should evolve logically.
- Education and documentaries that need steady facts across time.
- Creative editing and animation that keep style and identity intact.
- Robot or agent world models that require longâterm memory and causality.
This paperâs answer is simple but powerful: split the training into two complementary goals, each learned with the best teacher available, then stitch them together in one model that runs fast at inference time.
02Core Idea
Aha! Moment in one sentence: Teach one model to do two jobs at onceâmeanâseeking for longârange story flow and modeâseeking for closeâup realismâby giving them separate heads that share the same longâcontext understanding.
Letâs explain key ideas in kidâfriendly bites using the Sandwich pattern, in the right order.
đ Hook: You know how if you average everyoneâs doodle of a cat, you get a blurry cat? But if you pick the most popular, wellâdrawn style, it looks crisp. 𼏠Mean Seeking (what/how/why): Mean seeking is a training style that predicts the average best guess when things are uncertain. How: the model sees noisy inputs and learns to output a balanced, safe estimate that fits many possibilities. Why it matters: it stabilizes longârange structureâlike keeping the same character and scene over timeâbut by itself it can wash out fine details. đ Anchor: In a mystery story, mean seeking helps the plot stay on track, but some tiny clues might look a bit faded.
đ Hook: Imagine choosing a flavor at an iceâcream shopâyou pick a specific, yummy scoop, not a mix of all flavors. 𼏠Mode Seeking: Mode seeking aims for the most likely, highâquality options instead of averaging everything. How: it pushes the model toward sharp, realistic outcomes that the data shows most often. Why it matters: without it, faces, textures, and fast motions can look blurry or dull. đ Anchor: In a skateboarding video, mode seeking makes the board edges crisp and the motion snappy.
đ Hook: Think of a school project where one teammate plans the whole storyline (big picture) and another perfects each slideâs details (local quality). 𼏠Decoupled Diffusion Transformer (DDT): DDT is one shared longâcontext encoder with two small decoder headsâone for global meanâseeking and one for local modeâseeking. How: the encoder reads the whole (noisy) video and builds a unified representation; then the Flow Matching head learns longârange coherence from real long videos, while the Distribution Matching head learns local realism from a frozen shortâvideo teacher. Why it matters: if one head tried to do both jobs, their goals would clash; splitting them avoids interference. đ Anchor: Itâs like having a director (story flow) and a camera expert (crisp shots) who share the same script.
đ Hook: Picture a flipbook animation where each page change should feel smooth, not jumpy. 𼏠Flow Matching: Flow matching teaches how to smoothly move from noise to a clean video, focusing on the average best path. How: the model sees noised versions of real long videos and learns the âvelocityâ to denoise them step by step. Why it matters: without it, the bigâpicture timelineâwhoâs where, what happens nextâfalls apart over minutes. đ Anchor: In a soccer match video, flow matching helps the team positions and camera movement make sense over time.
đ Hook: Imagine checking short scenes through a magnifying glass to ensure each looks like a real miniâclip a pro would shoot. 𼏠Distribution Matching: Distribution matching makes each short window of the long video resemble the shortâvideo teacherâs output distribution. How: slide a small window across the long video, and for each, nudge the student to act like the teacher locally. Why it matters: without it, closeâups lose texture, motion looks mushy, and faces soften. đ Anchor: A 5âsecond street view stays sharp with lively traffic, just like the teacherâs short clips.
đ Hook: Think of comparing two opinions: not âhow do I cover all views?â but âhow do I focus on the strongest, most convincing one?â 𼏠ReverseâKL Divergence: ReverseâKL is a way to compare two distributions that favors concentrating on the teacherâs best, highâfidelity modes. How: it penalizes the student more for missing the teacherâs sharp peaks than for exploring extra. Why it matters: it avoids blurry averages and preserves crisp details. đ Anchor: When copying a star baker, you try to match their signature cake exactly, not an average of all desserts.
Multiple Analogies for the whole idea:
- Movie set: the director ensures the story makes sense scene to scene (meanâseeking), while the cinematographer keeps every shot crisp and wellâlit (modeâseeking). Sharing the same script keeps them aligned.
- Orchestra: the conductor maintains the pieceâs structure across movements; section leaders polish local passages so they sparkle. Same score, complementary roles.
- Cooking: the recipe ensures the meal flows course by course; the tasteâtests keep each bite delicious. One kitchen, two checks.
Before vs After:
- Before: Mixing durations blurred details; teacherâonly methods drifted or went static; trainingâfree tricks lacked real memory.
- After: A single model keeps shortâclip realism and learns longârange coherence, then generates long videos quickly with a few steps.
Why It Works (intuition): The best signal for minuteâscale structure is scarce long videosâperfect for meanâseeking flow matching. The best signal for crisp local realism is the abundant knowledge inside a frozen shortâvideo teacherâperfect for modeâseeking reverseâKL on sliding windows. Sharing one encoder lets both lessons shape the same longâcontext understanding without mixing their gradients in one head.
Building Blocks:
- Shared longâcontext encoder: one representation for both goals.
- Flow Matching head (meanâseeking): learns narrative and stability from long videos.
- Distribution Matching head (modeâseeking): learns local sharpness/motion from the teacher via reverseâKL.
- Sliding windows: apply the teacher check everywhere along the long video.
- Fewâstep sampler: the local head becomes fast at inference, enabling quick minuteâlong videos.
03Methodology
At a high level: Text prompt (and optional image) â add noise to a long video latent â Shared longâcontext encoder builds features â Two heads learn different things during training â At test time, use only the local Distribution Matching head to generate the video in a few steps.
Step 0. The stage and actors (data and latents)
- What happens: Videos live in a compressed âlatentâ space using a video VAE, so training and generation are efficient. We have two data sources: scarce long, coherent videos (tens of seconds to a minute), and a powerful shortâvideo teacher model (â5âsecond expertise) that we never retrain.
- Why this step exists: Latents make compute tractable for minuteâlong contexts.
- Example: A 30âsecond, 8âfps video becomes a latent tensor with 240 time steps; short windows might be 5 seconds (40 frames) long.
Step 1. Add noise along a simple path (rectified flow setup)
- What happens: We create a noising path from a clean video latent toward random noise by linearly mixing them. The modelâs job is to learn the âvelocityâ that brings noisy states back to the clean video.
- Why it matters: This sets up a stable training task where learning the velocity field equals learning how to denoise over time.
- Example: Take a real long video latent and a random latent, mix them by a fraction t (0 to 1), and ask the model for the correct velocity to move back to the clean video.
Step 2. Shared longâcontext encoder reads the whole scene
- What happens: The encoder looks at the noisy long video latent, the timestep t, and conditioning (like text) to build a unified spatiotemporal representation that spans the entire duration.
- Why it matters: Longârange dependencies (who/what/where over time) need a big receptive field; this encoder is the common backbone for both heads.
- Example: For âa child builds a sandcastle, then waves as the tide comes in,â the encoder must carry the idea that the same child, beach, and castle persist and evolve.
Step 3. Two lightweight decoder heads with different jobs A) Flow Matching (FM) head â meanâseeking global anchor
- What happens: On real long videos, the FM head learns to predict the velocity that maps noisy states back to ground truth. It is trained with supervised flow matching on fullâlength clips.
- Why it matters: Without FM SFT, the model has no reliable signal about true minuteâscale narrative and identity persistence.
- Example: If a dog runs from left to right and later returns, FM anchors those events in the right order without teleporting.
B) Distribution Matching (DM) head â modeâseeking local sharpness
- What happens: We slide a window (â5 s) across the studentâs own generated long video and query the frozen shortâvideo teacher on those windows. Using a reverseâKL style gradient (DMD/VSD), we nudge the student to match the teacherâs highâfidelity modes locally.
- Why it matters: Without this, local textures blur and motion loses snap; pure longâvideo SFT canât teach shortâclip finesse.
- Example: As the window passes a city street, the DM head learns from the teacher how to keep car edges crisp and pedestriansâ motion lively.
Step 4. Slidingâwindow alignment details
- What happens: For each training iteration of the DM head, we generate a student long video rollout, crop many overlapping short windows, and compare the studentâs local âvelocity/scoreâ to the teacherâs on the same noisy window. We backprop through the student windows (onâpolicy), not through the frozen teacher.
- Why it matters: Onâpolicy windows ensure we fix the studentâs actual mistakes in its own contexts, not just mismatches on random data.
- Example: If the student tends to blur faces after 20 seconds, windows near 20s will get strong corrective signals from the teacher.
Step 5. Boundary fix for windows (practical trick)
- What happens: Some models expect the first latent of a clip to be âimageâlike.â When cropping a window from the middle of a long latent, we reconstruct and reâencode a preceding frame as an image latent to prepend, then mask it so we donât train the VAE.
- Why it matters: Prevents semantic mismatch at window starts, which would confuse the teacher comparison.
- Example: Cropping a window starting at frame 120, we decode frame 119 to RGB, reâencode it as the first latent for the window, and mask its gradients.
Step 6. Joint training without head conflicts
- What happens: We combine two losses: (i) FM head gets supervised longâvideo flow matching; (ii) DM head gets the reverseâKL DMD/VSD surrogate on sliding windows. Both update the shared encoder; each head only sees its own gradients.
- Why it matters: If one head tried to satisfy both meanâseeking and modeâseeking at once, gradients would fightâaveraging vs. sharpening. Decoupling avoids interference.
- Example: The shared encoder learns that âthis is the same character at the beach,â while the DM head, guided by the teacher, ensures hair strands and wave foam stay detailed.
Step 7. Fast inference with the DM head
- What happens: At test time, we drop the FM head and use the DM head as a fewâstep sampler to generate the whole long video endâtoâend (bidirectionally). The encoder still provides longâcontext features shaped by FM training.
- Why it matters: We get the best of both worlds: longârange coherence (learned by FM) and crisp local realism (learned by DM), and we render quickly.
- Example: A 30âsecond prompted video comes out with sharp subjects and stable story beats in just a handful of denoising steps.
The Secret Sauce:
- Decoupling objectives: meanâseeking (global FM) and modeâseeking (local DM) live in separate heads, so they cooperate instead of collide.
- Slidingâwindow reverseâKL (DMD/VSD): uses the teacher only where itâs strongestâshort windowsâacross the entire long video.
- Onâpolicy updates: fix errors the student actually makes during its own rollouts.
- Fewâstep generation: the DM head becomes a fast sampler, turning long video creation from slow to snappy.
04Experiments & Results
The Test: The authors asked, âCan our model keep both the closeâup sparkle and the longârange story over half a minute?â They generated 30âsecond videos for 200 longâform prompts and measured:
- Subject consistency: Does the main character stay the same?
- Background consistency: Does the scene remain plausible over time?
- Motion smoothness and temporal flicker: Does movement feel natural without jitter?
- Dynamic degree: Is the video lively rather than frozen?
- Aesthetic and image quality: Do frames look pleasing and detailed?
- VLMâbased consistency (Geminiâ3âPro): A smart modelâs score for semantic consistency, with penalties if the video is just a still image.
The Competition (Baselines):
- Longâcontext SFT: Finetune only on long videos.
- Mixedâlength SFT: Train on a mixture of short and long videos (industrial standard).
- CausVid, SelfâForcing: Teacherâdriven autoregressive rollouts.
- InfinityRoPE: A strong lengthâextension trick that can go long but may get too static.
The Scoreboard (contextualized):
- The proposed method scored topâtier across the board and especially shined by balancing sharpness (image quality 0.6982) and motion smoothness (0.9863) with strong consistency (subject 0.9682, background 0.9548). Think of this as getting an A in story memory and an A in closeâup cinematography, while many baselines got an A in one column but a Bâ or C in the other.
- The dynamic degree stayed high (0.9453), indicating the videos were not just still frames pretending to be consistent. Aesthetic quality (0.5735) was competitive while maintaining motion and identityâimportant because some baselines looked pretty but drifted, or stayed consistent by becoming too still.
Surprising/Notable Findings:
- AR teacherâonly methods often looked punchy at first but drifted or became overâsaturated and less dynamic as time went on. This boosted some consistency metrics artificially (because not much moved), but the proposed method still outperformed them overall.
- Mixedâlength SFT helped coherence, yet tended to wash out crisp textures, confirming that a single meanâseeking objective under scarce longâvideo data blurs details.
Ablations (what parts matter?):
- No decoupled dual heads: Biggest dropâconfirms that mixing meanâseeking and modeâseeking into one head creates gradient conflicts.
- No slidingâwindow DMD (teacher matching): Model reverts to SFTâonlyâcoherence okay, but local sharpness and motion suffer.
- No SFT (only teacher windows): Motion can be decent, but global story/consistency dropsâteachers of short clips are blind to minuteâscale narrative.
- Full model: Best overallâeach ingredient is necessary to close the fidelityâhorizon gap.
Takeaway from the numbers: The new model simultaneously improves nearâframe quality and longârange coherence, while keeping videos dynamic. That combination is exactly what longâform generation needs and what prior approaches struggled to achieve together.
05Discussion & Limitations
Limitations:
- Needs a strong shortâvideo teacher: If the teacher is weak or outâofâdomain (e.g., unusual art styles), local sharpness learning may be limited.
- Longâvideo data scarcity remains: Although minimized, the method still requires some highâquality long videos for supervised flow matching.
- Compute and memory: Training with long contexts and sliding windows is nonâtrivial, even with efficient latents and parallelism.
- Domain shifts: Extremely fast scene changes, complex multiâshot edits, or rare events not seen by either teacher or longâvideo set can still challenge coherence.
Required Resources:
- A capable shortâvideo diffusion/flow teacher with accessible velocity/score queries.
- A curated longâvideo set (tens of seconds to a minute) covering narrative structures of interest.
- GPUs with sufficient memory; sequence/attention optimizations (e.g., DeepSpeed Ulysses) help.
When NOT to Use:
- If you only need short clips (â¤5 s) with top fidelity, direct shortâvideo models may be simpler and faster.
- If there is zero longâvideo data in your domain, the method can preserve local realism but wonât learn true minuteâscale narrative.
- If your output must be strictly causal at inference (frameâbyâframe streaming), you might prefer AR designs or adapt this model with causal masks.
Open Questions:
- How best to merge this bidirectional fast generator with causal AR rollouts for interactive or streaming scenarios?
- Can we distill the learned longâcontext knowledge into even fewer steps without quality loss?
- How robust is the approach to style transfer, multiâshot storytelling, or crossâmodal controls (audio, actions)?
- Can retrieval or memory modules further strengthen longârange identity and scene persistence beyond a minute?
06Conclusion & Future Work
3âSentence Summary: This paper teaches one model to do two complementary jobs: meanâseeking flow matching learns minuteâscale story structure from scarce long videos, while modeâseeking distribution matching copies crisp local realism from a frozen shortâvideo teacher. It keeps these goals in separate decoder heads that share a longâcontext encoder, preventing their gradients from fighting. At test time, the local head generates long videos in a few steps, yielding outputs that are both sharp and coherent.
Main Achievement: Closing the fidelityâhorizon gapâpreserving shortâclip sharpness and motion while scaling to minuteâlong coherenceâvia a simple, effective decoupling of objectives inside a single Decoupled Diffusion Transformer.
Future Directions: Combine this fast bidirectional model with causal autoregressive streaming; integrate retrieval/memory for even longer horizons; extend to multiâshot narratives and audioâvisual controls; and explore distilling to ultraâfewâstep samplers.
Why Remember This: Itâs a clean recipeâteach longârange story with meanâseeking, teach closeâup sparkle with modeâseeking, let one encoder tie them together, and run fast at inference. That simple split solves a stubborn problem others tried to bruteâforce, making long, lively, and consistent videos much more practical.
Practical Applications
- â˘Rapid prototyping of minuteâlong storyboards and animatics that stay coherent and sharp.
- â˘Educational videos where diagrams and labels remain consistent across long explanations.
- â˘Marketing and product demos with stable branding and style over extended sequences.
- â˘Previsualization for film and TV that preserves character identity across scenes.
- â˘Game trailer or cutâscene generation with lively motion and longârange narrative flow.
- â˘Longâhorizon simulation clips for robotics or autonomous systems world models.
- â˘Social media content creation where longer edits keep local detail and avoid drift.
- â˘Sports highlight synthesis that keeps players, uniforms, and fields consistent across plays.
- â˘Travel or nature videos with steady scenes and crisp textures over many seconds.
- â˘Interactive editing/animation pipelines that require fast, fewâstep longâvideo rendering.