Helios: Real Real-Time Long Video Generation Model

Shenghai Yuan; Yuanyang Yin; Zongjian Li; Xinwei Huang; Xiao Yang; Li Yuan

Helios: Real Real-Time Long Video Generation Model

Intermediate

Shenghai Yuan, Yuanyang Yin, Zongjian Li et al.3/4/2026

arXiv

Key Summary

•Helios is a 14-billion-parameter video model that can make minute-long videos in real time at about 19.5 frames per second on a single NVIDIA H100 GPU.
•It keeps videos steady over time without using the common anti-drift tricks like self-forcing, error-banks, or keyframe sampling.
•It reaches real-time speed without popular accelerators like KV-cache, sparse/linear attention, or quantization.
•A new idea called Unified History Injection lets one model do text-to-video, image-to-video, and video-to-video by feeding history and noisy future together.
•Guidance Attention makes the model listen to clean history and focus denoising only on the noisy future section.
•Easy Anti-Drifting tackles three drift types (position, color, restoration) with Relative RoPE, a First-Frame Anchor, and Frame-Aware Corrupt.
•Deep Compression Flow cuts token counts and sampling steps using Multi-Term Memory Patchification and a Pyramid Unified Predictor-Corrector, dropping from 50 sampling steps to just 3 after distillation.
•Adversarial Hierarchical Distillation trains the fast version using only a strong autoregressive teacher and real data, avoiding expensive rollout training.
•Across short and long video tests, Helios matches or beats strong baselines in quality while being dramatically faster.
•The team also built HeliosBench, a 240-prompt test set to fairly measure both speed and long-horizon stability.

Why This Research Matters

Real-time, long, stable video generation unlocks new kinds of interactive creativity: you can steer a scene as it’s being made, like co-directing a movie live. Educators can generate minute-long visual explanations on the fly, keeping students engaged with consistent, clear visuals. Game engines can synthesize background action in real time, changing scenery and stories as players make choices. Small studios and creators can iterate faster on storyboards and trailers without huge render farms. And social apps can let users turn a single photo into a smooth, minute-long story right on the spot. In short, Helios makes high-quality video generation feel immediate, steady, and useful for everyday creative work.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how filming a school play is easy if it’s just a 10‑second clip, but keeping the camera steady, the colors right, and the story clear for several minutes is much harder? That’s what video AIs have been facing. For a while, AI could make short, pretty clips, but longer videos went wobbly: objects drifted, colors slid, or the scene slowly turned blurry. And even short clips could take a long time to render, so “real-time” felt like a dream.

The world before: Video diffusion transformers became amazing at short clips. But making long, interactive, game-like videos at high quality and at interactive speeds was still out of reach. Most open systems topped out around 5–10 seconds and needed many minutes to render just that. When people tried to make videos much longer, the model would “forget” what it had made before, causing identity shifts (like a character changing looks), color shifts (like scenes becoming too bright or too dull), or restoration glitches (like getting blurrier over time).

The problem: How do we generate long, minute-scale videos that stay coherent while also being fast enough to feel live? Three roadblocks stood in the way. First, drifting: over time, errors pile up and the video slides off-course. Second, speed: big models produce better detail and motion but quickly become too slow. Third, practicality: training giant video models usually needs complex parallel setups and memory sharding, which makes them hard to build and share.

Failed attempts: Many teams fought drifting by copying the inference process during training (called self-forcing), keeping error libraries (error-banks), or carefully picking special frames (keyframes). These tricks helped some, but they were costly and often tied to how long you rolled out during training: if you trained on 5-second chunks, drift came roaring back past 5 seconds. Others tried switching to causal masking architectures, but that changed how the model used to think and sometimes hurt quality. For speed, popular boosts like KV-cache, sparse attention, or quantization helped, but often came with trade-offs, extra complexity, or still weren’t fast enough at large scale.

The gap: We needed a way to (1) prevent drifting without heavyweight tricks, (2) reach real-time speed without changing the model’s brain too much or relying on special hardware tricks, and (3) train a large model with simple, image-model-like batch sizes and no fancy sharding. In other words, we wanted long, stable, fast videos from a big model—trained and run in a simpler way.

Real stakes: This matters for daily life. Imagine a game engine that can generate scenes on the fly as you play, or a live streamer who prompts a model to create background animations in real time. Think of teachers explaining science with live, adjustable visuals, filmmakers prototyping shots instantly, or social apps letting you remix a photo into a smooth, minute-long story as you watch. These all need long, stable, fast video generation. Helios steps in to meet that need.

Helios’s claim: Build a 14B-parameter model that (a) runs at about 19.5 FPS on one H100 GPU, (b) keeps quality on minute-scale videos without the usual anti-drift recipes, and (c) is trained without parallelism/sharding, while fitting multiple copies in 80 GB by carefully compressing tokens and reducing sampling steps. The big idea is to treat long-video generation as endless continuation: keep a clean, compressed memory of what happened, add a small noisy window for the next chunk, and let the model focus on denoising only that window—guided by the clean memory and the text. Add simple training tricks that simulate drift and fix positional encodings, and then compress tokens and steps so the whole system can run live.

That’s the story: before, long videos were slow and wobbly; after, Helios shows a path to fast, steady, minute-long videos from a single big model—without leaning on the standard crutches.

02Core Idea

Aha in one sentence: Treat long video as infinite, guided continuation—keep a smart, compressed, clean memory of the past, denoise just a small future window, and make the past strongly guide the future while removing waste so it runs in real time.

Three analogies:

Relay race: The baton (clean history) guides the next runner (noisy future) exactly where to go; you don’t re-run the whole race, you just pass the baton well and sprint the next leg.
Painting mural panels: You sketch the next panel using a small canvas while constantly looking at the already-finished panels for guidance; you don’t repaint the whole wall each time.
Cooking with leftovers: You taste the dish you just made (history) and then season the next pot (noisy future) to match; you don’t start over every time.

Before vs. after:

Before: Models drifted after their training horizon, and real-time speeds needed heavy accelerations or tiny models that lost detail.
After: Helios controls drift at its roots (position, color, restoration) and runs a 14B model at ~19.5 FPS without KV-cache, sparse attention, or quantization, while matching or beating quality of strong baselines.

Why it works (intuition, no equations):

If you always keep a clean, compressed snapshot of the past and force your network to listen to it, then you don’t need to keep guessing the past again—you only polish the future window.
If you simulate the kinds of problems that happen in long runs during training (like color or blur shifts), the model learns to resist them at test time.
If you shrink the number of tokens you process (compress distant history more, and use low-res for early denoising steps), you spend compute only where it matters, which unlocks real-time speed.

Building blocks (each explained with the Sandwich pattern):

🍞 Hook: You know how a long movie needs to stay consistent from start to end?
🥬 Long-Video Generation: It means creating videos that last a long time while staying coherent. How: generate in chunks and make sure each new chunk matches the story so far. Why it matters: without it, characters, colors, or scenes will slowly drift and confuse viewers.
🍞 Anchor: A single-minute travel vlog where the scenery changes naturally but the traveler’s look and colors stay consistent.

🍞 Hook: Imagine writing a story one paragraph at a time, always reading what you wrote before.
🥬 Autoregressive Diffusion Model: A model that makes the next part of a video by denoising a noisy window, guided by past frames. How: add noise to the next window, then repeatedly predict and remove noise while peeking at clean history. Why it matters: without using history, each chunk forgets what came before.
🍞 Anchor: Animating a walking dog: each new step uses the last steps to keep the gait consistent.

🍞 Hook: Think of building block towers and adding new layers on top.
🥬 Unified History Injection: A method to feed the clean past (history) and the noisy future window together into the model. How: concatenate history and noisy future, give history a special “clean” time signal, and let it guide the future. Why it matters: without injecting history, the future wanders off.
🍞 Anchor: Extending a city skyline painting by taping the finished part next to your new canvas so you can match buildings.

🍞 Hook: Like telling your GPS whether you’re starting a trip (no history), mid-trip (last photo), or continuing a road you’ve been on (video history).
🥬 Representation Control: One input format unlocks T2V, I2V, and V2V: zeros for T2V, last frame for I2V, full past for V2V. How: pick which parts of history are zero or filled; the model auto-switches behavior. Why it matters: without it, you’d need separate models or hacks.
🍞 Anchor: Type a text to start a scene; drop in an image to extend it; or feed a clip to continue it—all with the same Helios interface.

🍞 Hook: You know how you listen to a coach more than crowd noise?
🥬 Guidance Attention: An attention design that boosts useful history signals to guide denoising of the noisy window. How: treat history as clean (timestep 0), amplify history keys per head, and apply text cross-attention only to the noisy part. Why it matters: without it, the model may blur or overreact to the wrong signals.
🍞 Anchor: Keeping a character’s face stable across scenes while changing background motion smoothly.

🍞 Hook: Trains stay on tracks when the rails are set right.
🥬 Easy Anti-Drifting: A trio of tricks to stop three drifts: position (time indexing), color (appearance drift), and restoration (blur/noise). How: Relative RoPE for time positions, First-Frame Anchor to lock global look, and Frame-Aware Corrupt to simulate realistic history errors during training. Why it matters: without these, long videos wobble and degrade.
🍞 Anchor: A bird documentary where the bird stays the same bird, colors don’t wash out, and motion remains sharp even after many seconds.

🍞 Hook: Studying far-away history like skimming headlines, and recent history like reading every word.
🥬 Multi-Term Memory Patchification: Compress distant history more than nearby history to keep token budget fixed. How: split history into short/mid/long term and apply bigger compression kernels the farther back you go. Why it matters: without it, tokens explode and you run out of memory or time.
🍞 Anchor: Remembering last 2 seconds in detail, last 10 seconds in medium detail, and the minute before as a compact summary.

🍞 Hook: Artists sketch big shapes first and add details later.
🥬 Pyramid Unified Predictor-Corrector: Do early denoising at low resolution (cheap) and refine at higher resolution (quality). How: sample across stages from low to high res, with a UniPC-style corrector inside each stage. Why it matters: without it, you waste compute early and stay slow.
🍞 Anchor: Rough car and road first, crisp headlights and textures later.

🍞 Hook: A coach who first imitates a great player, then polishes with real matches.
🥬 Adversarial Hierarchical Distillation: Teach a fast 3-step student from a strong autoregressive teacher plus a GAN head on real data. How: train in stages, no long rollouts, match teacher distribution, then add an adversarial boost. Why it matters: without it, you need 50 steps or expensive rollouts.
🍞 Anchor: Dropping from 50 to 3 steps while keeping natural motion and detail.

🍞 Hook: Packing a suitcase tightly so everything fits without wrinkling.
🥬 Deep Compression Flow: The overall plan to reduce tokens and steps to reach real-time. How: compress history tokens, sample in pyramids, and distill steps down. Why it matters: without compression, a 14B model can’t run live.
🍞 Anchor: 14B Helios hitting ~19.5 FPS on one H100.

🍞 Hook: Setting a ruler that always starts at where you are now, not at zero miles.
🥬 Relative RoPE: Use relative time indices per section so positions don’t drift or loop. How: keep history at 0.. $T_h$ ist and future at $T_h$ ist.. $T_h$ ist+ $T_n$ oisy for every chunk. Why it matters: without it, you get periodic resets or cycles.
🍞 Anchor: A skateboarder keeps moving forward smoothly instead of snapping back to the start.

🍞 Hook: Pinning your first photo as the color reference.
🥬 First-Frame Anchor: Always keep the very first frame in history to stabilize global appearance. How: include it in history every time; it’s a strong color/identity anchor. Why it matters: without it, color and identity drift over time.
🍞 Anchor: A character’s jacket stays the same red across the whole video.

🍞 Hook: Practice dealing with messy inputs so test day is easy.
🥬 Frame-Aware Corrupt: Randomly add realistic noise/blur/exposure shifts to history frames during training. How: independently corrupt each historical frame with certain probabilities. Why it matters: without it, small errors stack up at inference.
🍞 Anchor: Slightly blurry past frames don’t confuse the model later.

🍞 Hook: Testing cars on a track with timed laps and different distances.
🥬 HeliosBench: A 240-prompt benchmark with different durations to measure quality, drifting, and FPS. How: prompts across 81, 240, 720, 1440 frames with multiple metrics. Why it matters: without a fair track, you can’t compare models honestly.
🍞 Anchor: Helios scores high while staying fast across all distances.

03Methodology

At a high level: Text/Image/Video + History/Future Setup → Unified History Injection (Representation Control + Guidance Attention) → Easy Anti-Drifting (Relative RoPE + First-Frame Anchor + Frame-Aware Corrupt) → Deep Compression Flow (Multi-Term Memory Patchification + Pyramid Unified Predictor-Corrector) → Adversarial Hierarchical Distillation → Real-time Video Output.

Step A: Set up inputs with Unified History Injection

What happens: We split the input into two parts: a clean, compressed historical context (what already happened) and a small, noisy future window (what we’re about to generate). We feed both together so the past can guide the future.
Why this step exists: If the model doesn’t see clean history, it guesses from scratch and drifts.
Example: For a $384×640$ video, we might keep $T_h$ ist = 64 compressed frames and denoise $T_n$ oisy = 16 frames next. The history includes the very first frame and the most recent chunk.

Inside A1: Representation Control (task switching)

What: One input format supports T2V, I2V, and V2V. History all zeros → T2V; last frame present → I2V; full clip present → V2V.
Why: Avoid separate models and keep behavior consistent.
Example: Start from text for 5 seconds; then drop in a still image to change style; then feed back the produced video to continue—no model swap needed.

Inside A2: Guidance Attention (who influences whom)

What: Mark history as clean (t=0), amplify its keys per attention head, and apply text cross-attention only to the noisy window.
Why: Prevent the model from re-denoising history or overreacting to noisy signals; make history strongly guide the new frames.
Example: In a car chase, history locks the car’s identity and position trends; the noisy window learns the next frames’ motion.

Step B: Easy Anti-Drifting during training

What happens: We fix time indexing with Relative RoPE, lock appearance with the First-Frame Anchor, and simulate messy past with Frame-Aware Corrupt.
Why this step exists: Long-range drift has three roots: wrong positions (repetition), color creep, and restoration errors. We train the model to resist each.
Examples:
- Relative RoPE: Every chunk uses a local time ruler, preventing periodic resets.
- First-Frame Anchor: The original first frame stays in history forever, calming color stats.
- Frame-Aware Corrupt: Randomly add noise/blur/exposure to historical frames so the model learns to handle imperfect past.

Step C: Deep Compression Flow to reach real-time C1) Multi-Term Memory Patchification (reduce history tokens)

What happens: Split history into short/mid/long-term and compress more aggressively the farther back you go, keeping a near-constant token budget as history grows.
Why: Without this, tokens blow up with history length, causing OOM or slowdowns.
Example: Short-term history might use small kernels ( $t×h$ ×w = $4×8$ ×8), mid-term medium, long-term large; the total token count stays roughly fixed even as you extend context.

C2) Pyramid Unified Predictor-Corrector (reduce noisy tokens and steps)

What happens: Do early denoising at low resolution (cheap) focusing on structure; later, refine at higher resolution (quality). Use a UniPC-like corrector per stage and reset caches at stage switches to stay stable.
Why: Early steps don’t need full resolution; saving tokens early saves huge compute.
Example: Start at 1/4 resolution for 1 step (layout/colors), then 1/2 resolution for 1 step (edges), then full resolution for 1 step (textures)—total 3 steps after distillation.

Step D: Adversarial Hierarchical Distillation (drop from 50 steps to 3)

What happens: Use a strong autoregressive teacher (Helios-Base) to supervise a fast student, with staged backward simulation and a small GAN head trained on real data to boost realism.
Why: Few-step generation needs guidance; GAN post-training breaks the teacher’s ceiling and sharpens details.
Example: The 3-step student learns to match the 50-step teacher’s distribution and then improves naturalness using real video patches via a discriminator.

Step E: Infrastructure-level optimizations (training practicality)

What happens: Token compression plus pyramids shrink activation sizes; memory tricks (like caching gradients for GAN, sharded EMA, async VRAM freeing) let multiple 14B components fit within 80 GB when needed; custom Triton kernels speed LayerNorm/RMSNorm and RoPE.
Why: Large video models are memory-bound; these changes make training/inference feasible without parallelism/sharding for early stages, and manageable for later stages.
Example: Flash Normalization and Flash RoPE shave seconds off per 50-step pass; sharded EMA avoids duplicating FP32 weights.

Putting it together in a toy walk-through:

Input: “A yellow sports car drives along a mountain road at sunset.” History initially zero (T2V).
A: Unified History Injection concatenates history (zeros) + noisy future window; Representation Control knows this is T2V; Guidance Attention ensures text steers the noisy window while history stays clean.
B: During training, Relative RoPE keeps time indices local; First-Frame Anchor holds global look; Frame-Aware Corrupt teaches resilience to noisy past.
C: Multi-Term Memory Patchification summarizes earlier chunks; Pyramid sampling handles early structure at low-res, then details at high-res.
D: Distillation shrinks from 50 steps to 3 while keeping quality; a tiny GAN head nudges realism.
Output: A minute-scale, coherent, natural video at roughly real-time speed on a single H100.

Secret sauce (why it’s clever):

It reframes the problem from “generate forever with causal masks” to “always continue from a clean, compressed past,” preserving bidirectional inference quality.
It attacks drift at its roots with simple, robust training tricks instead of heavy rollout heuristics.
It spends compute where it matters (nearby history and late refinements) and saves everywhere else (far history and early steps).

04Experiments & Results

The test: The team built HeliosBench, a 240-prompt test set split across durations: 81 (very short), 240 (short), 720 (medium), and 1440 (long) frames. They measured speed (end-to-end FPS, including VAE and text encoder) and quality metrics mapped to a 10-point scale: Aesthetic, Dynamic (motion amount), Motion Smoothness, Semantic (text alignment), and Naturalness. They also tracked drifting across dimensions to see how stability held over time.

The competition: Helios was compared to a wide range of base and distilled models across sizes, including SANA Video, CogVideoX, Mochi, HV Video, Wan, LTX Video, Kandinsky, StepVideo, NOVA, Pyramid Flow, MAGI, InfinityStar, SkyReelsV2, CausVid, Self-Forcing, Rolling Forcing, LongLive, Infinite Forcing, Reward Forcing, Dummy Forcing, SANA Video Long, and Krea. Some baselines used heavy acceleration tricks; others relied on small backbones (≈1.3B) for speed.

The scoreboard with context:

Speed: Helios-Distilled (14B) ran at about 19.5 FPS on a single H100—like finishing a mile in 4 minutes when many runners need 10. It’s not just faster than similar-scale systems; it even beats some 1.3B distilled models by a wide margin.
Short videos (81 frames): Helios reached an overall score around 6.00 on the 10-point scale—matching or surpassing most base models and outperforming several distilled ones. In plain terms: it gets an A- while many get B’s, even though it runs much faster.
Long videos (up to 1440 frames): Helios achieved a total score around 7.08 with strong Naturalness and low drifting. Think of staying on the honor roll across the whole school year, not just the first month.
Drift: Compared to strong forced-rollout baselines, Helios kept lower drift in Aesthetic, Semantic, and Naturalness. That means colors stayed consistent, text instructions stayed followed, and scenes remained believable as the minutes passed.

Surprising findings:

Big and fast can coexist: The 14B Helios hits real-time speeds without KV-cache, sparse attention, or quantization. That’s like a race car going top speed without using turbo boosters.
No self-forcing needed: Helios avoided the usual train-as-infer rollouts yet matched or beat methods that depend on them. Relative RoPE, First-Frame Anchor, and Frame-Aware Corrupt covered much of what rollouts tried to fix.
Few steps, high quality: Distilling from 50 to 3 steps usually hurts a lot, but with an autoregressive teacher and adversarial post-training, Helios preserved detail and natural motion.

Qualitative examples: Side-by-sides show Helios maintaining character identity and lighting over hundreds of frames, while some baselines get saturated, blur, or flicker. In action scenes, Helios balances motion and smoothness—lively but not jittery—matching how humans expect physics to behave.

Throughput comparison: SANA Video Long is smaller (2B) but slower than Helios; Krea-RealTime-14B is notably slower on H100 and drifts more. Versus Wan-14B base models, Helios is vastly faster and holds up better on long horizons after distillation.

Takeaway: Across both short and long benchmarks, Helios consistently delivers better or comparable quality at much higher speeds, while demonstrating strong anti-drift without the usual heavyweight tricks.

05Discussion & Limitations

Limitations:

Hardware still matters: Although Helios avoids many accelerators, hitting ~19.5 FPS used a single H100; smaller GPUs will run slower. Training stages used large GPU counts in later phases (e.g., 128 H100s), so reproducing training end-to-end requires serious compute.
Model size: 14B parameters give rich motion and detail, but the size can be heavy for edge devices or mobile scenarios.
Resolution bounds: Training capped at $384×640$ and 109-frame snippets per training example; while the system generalizes well, pushing much higher resolution or drastically longer horizons may need extra tuning.
Metrics gap: Automated video metrics don’t perfectly match human taste; while user studies help, there’s room for better evaluators.

Required resources:

For best performance: an H100-class GPU for real-time inference; large multi-GPU setups for full training (especially Stage 3 distillation).
Memory-savvy code: benefits from custom kernels (Flash Norm, Flash RoPE), sharded EMA, and async offloading.

When NOT to use:

Ultra-high-res film output (e.g., 4K at high FPS) on tight hardware budgets—current design targets real-time at moderate resolution.
Situations needing strict causal masking behavior or exact replication of a bidirectional pretrain regime; Helios intentionally preserves bidirectional inference without masks.
Edge devices with very limited VRAM or older GPUs without the bandwidth to benefit from the compression tricks.

Open questions:

Scaling to higher resolutions: Can the pyramid and memory compression ideas stretch to 1080p or 4K at real time?
Even fewer steps: Can we go from 3 steps to 1–2 without losing naturalness?
Better drift detectors: Could smarter, learned drift monitors improve adaptive sampling during inference?
Universal world modeling: How well does Helios’s continuation framing generalize to interactive 3D engines or multi-view video?
Robustness to edits: Prompt interpolation helps; can we support complex mid-video edits (object insertion, camera path changes) with guaranteed smoothness?

06Conclusion & Future Work

Three-sentence summary: Helios reframes long video generation as infinite continuation guided by a clean, compressed memory of the past and a small, denoised future window. By tackling drift at its roots and compressing both tokens and steps, a 14B model reaches ~19.5 FPS on one H100 while sustaining minute-scale quality and coherence. It does so without common anti-drift heuristics or standard accelerators, and it trains with practical memory strategies.

Main achievement: Unifying T2V/I2V/V2V in a single, autoregressive diffusion transformer that maintains high quality over long horizons while running in real time—achieved via Unified History Injection, Guidance Attention, Easy Anti-Drifting, and Deep Compression Flow, capped by a strong distillation pipeline.

Future directions: Push resolution higher while keeping real-time speed; reduce steps further; develop better, human-aligned video metrics; extend interactive controls (edits, camera paths); and explore multi-camera or 3D-aware versions.

Why remember this: Helios shows that you don’t need to choose between big-and-beautiful or small-and-fast—careful design can give you both. Its continuation framing, simple anti-drift training, and smart compression could shape the next generation of real-time, long-horizon visual models for games, education, films, and live creativity.

Practical Applications

•Live prototyping for filmmakers: generate preview shots and transitions in real time during pitching or blocking.
•Interactive game scenes: synthesize environment details and NPC background motion on the fly as players move.
•Education videos: create minute-long science or history explainers with consistent visuals that update as teachers adjust prompts.
•Marketing and social content: turn a product photo into a polished, minute-long story ad in real time.
•Virtual events and streaming: generate dynamic backgrounds or theme changes live without pre-rendering.
•Storyboarding from text: quickly explore multiple narrative directions by continuing or editing scenes on demand.
•Photo-to-clip makeovers: extend a single image into a coherent video sequence for reels or shorts.
•Video-to-video restyling: continue a user’s clip in a new style while keeping identity and colors consistent.
•Interactive art installations: audience prompts smoothly morph ongoing visuals without flicker or resets.
•Rapid simulation previews: create long, stable visualization sequences for robotics or planning demos.

Version: 1