Mode Seeking meets Mean Seeking for Fast Long Video Generation

Shengqu Cai; Weili Nie; Chao Liu; Julius Berner; Lvmin Zhang; Nanye Ma; Hansheng Chen; Maneesh Agrawala; Leonidas Guibas; Gordon Wetzstein; Arash Vahdat

Mode Seeking meets Mean Seeking for Fast Long Video Generation

Intermediate

Shengqu Cai, Weili Nie, Chao Liu et al.2/27/2026

arXiv

Key Summary

•Short videos are easy for AI to make sharp and lively, but long videos need stories and memory, and there isn’t much training data for that.
•This paper splits the job in two: one part learns long-term story flow (mean-seeking), and another part keeps close-up details crisp (mode-seeking).
•They use one shared “brain” (encoder) with two small “mouths” (heads): a Flow Matching head for global coherence and a Distribution Matching head that copies a short‑video expert locally.
•The local head learns by sliding along the long video in short windows and matching a frozen short‑video teacher using a mode‑seeking reverse‑KL trick (DMD/VSD).
•The global head learns long-range narrative by supervised flow matching on scarce but real long videos.
•At test time, they only use the local head as a fast, few‑step sampler, so long videos render quickly and stay sharp.
•Results show better motion, sharpness, and consistency over minutes than strong baselines like mixed-length SFT and autoregressive rollouts.
•An ablation study shows all parts matter: decoupled heads, sliding‑window teacher matching, and long‑video SFT each add crucial gains.
•This closes the “fidelity–horizon” gap: the model keeps short‑clip realism while scaling to minute‑long, coherent stories.
•The approach needs a good short‑video teacher and some long‑video data, but it’s modular and can pair with autoregressive methods in the future.

Why This Research Matters

Long, coherent, high‑quality videos power storytelling, education, simulation, and games by keeping characters and scenes consistent over time. This method makes such videos faster to generate and more realistic by combining global story learning with local detail preservation. Creators can prototype films or ads that stay sharp from start to finish, while students can watch science explainers that visually remain stable. Game and robotics simulations benefit from believable worlds that don’t drift or glitch after a few seconds. By needing only a modest set of long videos and a good short‑video teacher, this approach is practical for many teams, not just the largest labs.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how making a 5‑second clip of a bouncing ball is easy, but filming a whole soccer game is harder because the players, score, and storyline must stay consistent for a long time? That’s the main challenge in AI video: short clips are easy to learn and look great; long videos need memory and structure.

The World Before: AI got really good at short video generation thanks to tons of web data with a few seconds per clip. With that abundance, models learned to make crisp textures, punchy motion, and high fidelity. People often scaled image/video models by mixing data of different sizes or resolutions. For images, higher resolution is mostly an interpolation of the same local patches (a $256×256$ and a $1024×1024$ photo share similar local textures). This idea tempted researchers to treat time like pixels: just mix short and long clips and let the model “figure it out.”

The Problem: Time isn’t like resolution. A minute‑long video isn’t just a zoomed‑in 5‑second clip—it’s an extrapolation: new events happen, characters change places, and causes lead to effects. Long videos demand narrative structure and stable identities over many seconds. But high‑quality long videos are rare and expensive to collect and clean. When a single model tries to learn everything at once from scarce long videos, it often forgets how to make the sharp local details that short‑video experts mastered. Outputs become soft, less detailed, and less lively.

Failed Attempts:

Mixed‑length supervised finetuning (SFT): Train on a soup of different clip durations. It helps some with longer context but often blurs local detail because the objective averages over uncertainty, especially when long‑video data is scarce.
Teacher‑only approaches with autoregressive rollouts: Use a great short‑video teacher to generate longer videos step by step. They keep local sharpness at first but drift over time (small errors compound), get over‑saturated, or turn too static to avoid mistakes. They also lack true long‑range grounding since the teacher never learned minute‑scale structure.
Training‑free horizon hacks (e.g., noise schedules, positional tweaks): Extend length without retraining. Handy, but they can’t teach real story memory or fix long‑range identity consistency.

The Gap: We need a way to keep the short‑clip excellence (crisp looks and lively motion) while truly learning the big‑picture story over minutes—without demanding a mountain of long‑video data. In other words, decouple local fidelity from long‑term coherence, so each can be learned with its best available signal.

Real Stakes: Imagine a kid’s cartoon where the main character’s hat changes color randomly after 20 seconds, or a travel video where the city background suddenly hops across the globe. Long, coherent videos matter for:

Storytelling and film prototypes where characters and scenes persist.
Games and simulation where the world should evolve logically.
Education and documentaries that need steady facts across time.
Creative editing and animation that keep style and identity intact.
Robot or agent world models that require long‑term memory and causality.

This paper’s answer is simple but powerful: split the training into two complementary goals, each learned with the best teacher available, then stitch them together in one model that runs fast at inference time.

02Core Idea

Aha! Moment in one sentence: Teach one model to do two jobs at once—mean‑seeking for long‑range story flow and mode‑seeking for close‑up realism—by giving them separate heads that share the same long‑context understanding.

Let’s explain key ideas in kid‑friendly bites using the Sandwich pattern, in the right order.

🍞 Hook: You know how if you average everyone’s doodle of a cat, you get a blurry cat? But if you pick the most popular, well‑drawn style, it looks crisp. 🥬 Mean Seeking (what/how/why): Mean seeking is a training style that predicts the average best guess when things are uncertain. How: the model sees noisy inputs and learns to output a balanced, safe estimate that fits many possibilities. Why it matters: it stabilizes long‑range structure—like keeping the same character and scene over time—but by itself it can wash out fine details. 🍞 Anchor: In a mystery story, mean seeking helps the plot stay on track, but some tiny clues might look a bit faded.

🍞 Hook: Imagine choosing a flavor at an ice‑cream shop—you pick a specific, yummy scoop, not a mix of all flavors. 🥬 Mode Seeking: Mode seeking aims for the most likely, high‑quality options instead of averaging everything. How: it pushes the model toward sharp, realistic outcomes that the data shows most often. Why it matters: without it, faces, textures, and fast motions can look blurry or dull. 🍞 Anchor: In a skateboarding video, mode seeking makes the board edges crisp and the motion snappy.

🍞 Hook: Think of a school project where one teammate plans the whole storyline (big picture) and another perfects each slide’s details (local quality). 🥬 Decoupled Diffusion Transformer (DDT): DDT is one shared long‑context encoder with two small decoder heads—one for global mean‑seeking and one for local mode‑seeking. How: the encoder reads the whole (noisy) video and builds a unified representation; then the Flow Matching head learns long‑range coherence from real long videos, while the Distribution Matching head learns local realism from a frozen short‑video teacher. Why it matters: if one head tried to do both jobs, their goals would clash; splitting them avoids interference. 🍞 Anchor: It’s like having a director (story flow) and a camera expert (crisp shots) who share the same script.

🍞 Hook: Picture a flipbook animation where each page change should feel smooth, not jumpy. 🥬 Flow Matching: Flow matching teaches how to smoothly move from noise to a clean video, focusing on the average best path. How: the model sees noised versions of real long videos and learns the “velocity” to denoise them step by step. Why it matters: without it, the big‑picture timeline—who’s where, what happens next—falls apart over minutes. 🍞 Anchor: In a soccer match video, flow matching helps the team positions and camera movement make sense over time.

🍞 Hook: Imagine checking short scenes through a magnifying glass to ensure each looks like a real mini‑clip a pro would shoot. 🥬 Distribution Matching: Distribution matching makes each short window of the long video resemble the short‑video teacher’s output distribution. How: slide a small window across the long video, and for each, nudge the student to act like the teacher locally. Why it matters: without it, close‑ups lose texture, motion looks mushy, and faces soften. 🍞 Anchor: A 5‑second street view stays sharp with lively traffic, just like the teacher’s short clips.

🍞 Hook: Think of comparing two opinions: not “how do I cover all views?” but “how do I focus on the strongest, most convincing one?” 🥬 Reverse‑KL Divergence: Reverse‑KL is a way to compare two distributions that favors concentrating on the teacher’s best, high‑fidelity modes. How: it penalizes the student more for missing the teacher’s sharp peaks than for exploring extra. Why it matters: it avoids blurry averages and preserves crisp details. 🍞 Anchor: When copying a star baker, you try to match their signature cake exactly, not an average of all desserts.

Multiple Analogies for the whole idea:

Movie set: the director ensures the story makes sense scene to scene (mean‑seeking), while the cinematographer keeps every shot crisp and well‑lit (mode‑seeking). Sharing the same script keeps them aligned.
Orchestra: the conductor maintains the piece’s structure across movements; section leaders polish local passages so they sparkle. Same score, complementary roles.
Cooking: the recipe ensures the meal flows course by course; the taste‑tests keep each bite delicious. One kitchen, two checks.

Before vs After:

Before: Mixing durations blurred details; teacher‑only methods drifted or went static; training‑free tricks lacked real memory.
After: A single model keeps short‑clip realism and learns long‑range coherence, then generates long videos quickly with a few steps.

Why It Works (intuition): The best signal for minute‑scale structure is scarce long videos—perfect for mean‑seeking flow matching. The best signal for crisp local realism is the abundant knowledge inside a frozen short‑video teacher—perfect for mode‑seeking reverse‑KL on sliding windows. Sharing one encoder lets both lessons shape the same long‑context understanding without mixing their gradients in one head.

Building Blocks:

Shared long‑context encoder: one representation for both goals.
Flow Matching head (mean‑seeking): learns narrative and stability from long videos.
Distribution Matching head (mode‑seeking): learns local sharpness/motion from the teacher via reverse‑KL.
Sliding windows: apply the teacher check everywhere along the long video.
Few‑step sampler: the local head becomes fast at inference, enabling quick minute‑long videos.

03Methodology

At a high level: Text prompt (and optional image) → add noise to a long video latent → Shared long‑context encoder builds features → Two heads learn different things during training → At test time, use only the local Distribution Matching head to generate the video in a few steps.

Step 0. The stage and actors (data and latents)

What happens: Videos live in a compressed “latent” space using a video VAE, so training and generation are efficient. We have two data sources: scarce long, coherent videos (tens of seconds to a minute), and a powerful short‑video teacher model (≈5‑second expertise) that we never retrain.
Why this step exists: Latents make compute tractable for minute‑long contexts.
Example: A 30‑second, 8‑fps video becomes a latent tensor with 240 time steps; short windows might be 5 seconds (40 frames) long.

Step 1. Add noise along a simple path (rectified flow setup)

What happens: We create a noising path from a clean video latent toward random noise by linearly mixing them. The model’s job is to learn the “velocity” that brings noisy states back to the clean video.
Why it matters: This sets up a stable training task where learning the velocity field equals learning how to denoise over time.
Example: Take a real long video latent and a random latent, mix them by a fraction t (0 to 1), and ask the model for the correct velocity to move back to the clean video.

Step 2. Shared long‑context encoder reads the whole scene

What happens: The encoder looks at the noisy long video latent, the timestep t, and conditioning (like text) to build a unified spatiotemporal representation that spans the entire duration.
Why it matters: Long‑range dependencies (who/what/where over time) need a big receptive field; this encoder is the common backbone for both heads.
Example: For “a child builds a sandcastle, then waves as the tide comes in,” the encoder must carry the idea that the same child, beach, and castle persist and evolve.

Step 3. Two lightweight decoder heads with different jobs A) Flow Matching (FM) head — mean‑seeking global anchor

What happens: On real long videos, the FM head learns to predict the velocity that maps noisy states back to ground truth. It is trained with supervised flow matching on full‑length clips.
Why it matters: Without FM SFT, the model has no reliable signal about true minute‑scale narrative and identity persistence.
Example: If a dog runs from left to right and later returns, FM anchors those events in the right order without teleporting.

B) Distribution Matching (DM) head — mode‑seeking local sharpness

What happens: We slide a window (≈5 s) across the student’s own generated long video and query the frozen short‑video teacher on those windows. Using a reverse‑KL style gradient (DMD/VSD), we nudge the student to match the teacher’s high‑fidelity modes locally.
Why it matters: Without this, local textures blur and motion loses snap; pure long‑video SFT can’t teach short‑clip finesse.
Example: As the window passes a city street, the DM head learns from the teacher how to keep car edges crisp and pedestrians’ motion lively.

Step 4. Sliding‑window alignment details

What happens: For each training iteration of the DM head, we generate a student long video rollout, crop many overlapping short windows, and compare the student’s local “velocity/score” to the teacher’s on the same noisy window. We backprop through the student windows (on‑policy), not through the frozen teacher.
Why it matters: On‑policy windows ensure we fix the student’s actual mistakes in its own contexts, not just mismatches on random data.
Example: If the student tends to blur faces after 20 seconds, windows near 20s will get strong corrective signals from the teacher.

Step 5. Boundary fix for windows (practical trick)

What happens: Some models expect the first latent of a clip to be “image‑like.” When cropping a window from the middle of a long latent, we reconstruct and re‑encode a preceding frame as an image latent to prepend, then mask it so we don’t train the VAE.
Why it matters: Prevents semantic mismatch at window starts, which would confuse the teacher comparison.
Example: Cropping a window starting at frame 120, we decode frame 119 to RGB, re‑encode it as the first latent for the window, and mask its gradients.

Step 6. Joint training without head conflicts

What happens: We combine two losses: (i) FM head gets supervised long‑video flow matching; (ii) DM head gets the reverse‑KL DMD/VSD surrogate on sliding windows. Both update the shared encoder; each head only sees its own gradients.
Why it matters: If one head tried to satisfy both mean‑seeking and mode‑seeking at once, gradients would fight—averaging vs. sharpening. Decoupling avoids interference.
Example: The shared encoder learns that “this is the same character at the beach,” while the DM head, guided by the teacher, ensures hair strands and wave foam stay detailed.

Step 7. Fast inference with the DM head

What happens: At test time, we drop the FM head and use the DM head as a few‑step sampler to generate the whole long video end‑to‑end (bidirectionally). The encoder still provides long‑context features shaped by FM training.
Why it matters: We get the best of both worlds: long‑range coherence (learned by FM) and crisp local realism (learned by DM), and we render quickly.
Example: A 30‑second prompted video comes out with sharp subjects and stable story beats in just a handful of denoising steps.

The Secret Sauce:

Decoupling objectives: mean‑seeking (global FM) and mode‑seeking (local DM) live in separate heads, so they cooperate instead of collide.
Sliding‑window reverse‑KL (DMD/VSD): uses the teacher only where it’s strongest—short windows—across the entire long video.
On‑policy updates: fix errors the student actually makes during its own rollouts.
Few‑step generation: the DM head becomes a fast sampler, turning long video creation from slow to snappy.

04Experiments & Results

The Test: The authors asked, “Can our model keep both the close‑up sparkle and the long‑range story over half a minute?” They generated 30‑second videos for 200 long‑form prompts and measured:

Subject consistency: Does the main character stay the same?
Background consistency: Does the scene remain plausible over time?
Motion smoothness and temporal flicker: Does movement feel natural without jitter?
Dynamic degree: Is the video lively rather than frozen?
Aesthetic and image quality: Do frames look pleasing and detailed?
VLM‑based consistency (Gemini‑3‑Pro): A smart model’s score for semantic consistency, with penalties if the video is just a still image.

The Competition (Baselines):

Long‑context SFT: Finetune only on long videos.
Mixed‑length SFT: Train on a mixture of short and long videos (industrial standard).
CausVid, Self‑Forcing: Teacher‑driven autoregressive rollouts.
InfinityRoPE: A strong length‑extension trick that can go long but may get too static.

The Scoreboard (contextualized):

The proposed method scored top‑tier across the board and especially shined by balancing sharpness (image quality 0.6982) and motion smoothness (0.9863) with strong consistency (subject 0.9682, background 0.9548). Think of this as getting an A in story memory and an A in close‑up cinematography, while many baselines got an A in one column but a B‑ or C in the other.
The dynamic degree stayed high (0.9453), indicating the videos were not just still frames pretending to be consistent. Aesthetic quality (0.5735) was competitive while maintaining motion and identity—important because some baselines looked pretty but drifted, or stayed consistent by becoming too still.

Surprising/Notable Findings:

AR teacher‑only methods often looked punchy at first but drifted or became over‑saturated and less dynamic as time went on. This boosted some consistency metrics artificially (because not much moved), but the proposed method still outperformed them overall.
Mixed‑length SFT helped coherence, yet tended to wash out crisp textures, confirming that a single mean‑seeking objective under scarce long‑video data blurs details.

Ablations (what parts matter?):

No decoupled dual heads: Biggest drop—confirms that mixing mean‑seeking and mode‑seeking into one head creates gradient conflicts.
No sliding‑window DMD (teacher matching): Model reverts to SFT‑only—coherence okay, but local sharpness and motion suffer.
No SFT (only teacher windows): Motion can be decent, but global story/consistency drops—teachers of short clips are blind to minute‑scale narrative.
Full model: Best overall—each ingredient is necessary to close the fidelity–horizon gap.

Takeaway from the numbers: The new model simultaneously improves near‑frame quality and long‑range coherence, while keeping videos dynamic. That combination is exactly what long‑form generation needs and what prior approaches struggled to achieve together.

05Discussion & Limitations

Limitations:

Needs a strong short‑video teacher: If the teacher is weak or out‑of‑domain (e.g., unusual art styles), local sharpness learning may be limited.
Long‑video data scarcity remains: Although minimized, the method still requires some high‑quality long videos for supervised flow matching.
Compute and memory: Training with long contexts and sliding windows is non‑trivial, even with efficient latents and parallelism.
Domain shifts: Extremely fast scene changes, complex multi‑shot edits, or rare events not seen by either teacher or long‑video set can still challenge coherence.

Required Resources:

A capable short‑video diffusion/flow teacher with accessible velocity/score queries.
A curated long‑video set (tens of seconds to a minute) covering narrative structures of interest.
GPUs with sufficient memory; sequence/attention optimizations (e.g., DeepSpeed Ulysses) help.

When NOT to Use:

If you only need short clips (≤5 s) with top fidelity, direct short‑video models may be simpler and faster.
If there is zero long‑video data in your domain, the method can preserve local realism but won’t learn true minute‑scale narrative.
If your output must be strictly causal at inference (frame‑by‑frame streaming), you might prefer AR designs or adapt this model with causal masks.

Open Questions:

How best to merge this bidirectional fast generator with causal AR rollouts for interactive or streaming scenarios?
Can we distill the learned long‑context knowledge into even fewer steps without quality loss?
How robust is the approach to style transfer, multi‑shot storytelling, or cross‑modal controls (audio, actions)?
Can retrieval or memory modules further strengthen long‑range identity and scene persistence beyond a minute?

06Conclusion & Future Work

3‑Sentence Summary: This paper teaches one model to do two complementary jobs: mean‑seeking flow matching learns minute‑scale story structure from scarce long videos, while mode‑seeking distribution matching copies crisp local realism from a frozen short‑video teacher. It keeps these goals in separate decoder heads that share a long‑context encoder, preventing their gradients from fighting. At test time, the local head generates long videos in a few steps, yielding outputs that are both sharp and coherent.

Main Achievement: Closing the fidelity–horizon gap—preserving short‑clip sharpness and motion while scaling to minute‑long coherence—via a simple, effective decoupling of objectives inside a single Decoupled Diffusion Transformer.

Future Directions: Combine this fast bidirectional model with causal autoregressive streaming; integrate retrieval/memory for even longer horizons; extend to multi‑shot narratives and audio‑visual controls; and explore distilling to ultra‑few‑step samplers.

Why Remember This: It’s a clean recipe—teach long‑range story with mean‑seeking, teach close‑up sparkle with mode‑seeking, let one encoder tie them together, and run fast at inference. That simple split solves a stubborn problem others tried to brute‑force, making long, lively, and consistent videos much more practical.

Practical Applications

•Rapid prototyping of minute‑long storyboards and animatics that stay coherent and sharp.
•Educational videos where diagrams and labels remain consistent across long explanations.
•Marketing and product demos with stable branding and style over extended sequences.
•Previsualization for film and TV that preserves character identity across scenes.
•Game trailer or cut‑scene generation with lively motion and long‑range narrative flow.
•Long‑horizon simulation clips for robotics or autonomous systems world models.
•Social media content creation where longer edits keep local detail and avoid drift.
•Sports highlight synthesis that keeps players, uniforms, and fields consistent across plays.
•Travel or nature videos with steady scenes and crisp textures over many seconds.
•Interactive editing/animation pipelines that require fast, few‑step long‑video rendering.

Version: 1