Causal Motion Diffusion Models for Autoregressive Motion Generation

Qing Yu; Akihisa Watanabe; Kent Fujiwara

Causal Motion Diffusion Models for Autoregressive Motion Generation

Intermediate

Qing Yu, Akihisa Watanabe, Kent Fujiwara2/26/2026

arXiv

Key Summary

•The paper introduces CMDM, a new way to make computer-generated human motions that feel smooth over time and match the meaning of a text prompt.
•Old diffusion models looked at the whole motion clip at once, which broke the rule of time order and made live streaming impossible.
•Old autoregressive models generated frame-by-frame but often drifted off course over long clips, causing wobbles and mistakes.
•CMDM combines the best of both: diffusion’s high quality with strict “past-to-future” causality for stability and real-time use.
•A special causal VAE (MAC-VAE) learns motion features that are aligned with language and only depend on the past, not the future.
•A Causal Diffusion Transformer then denoises frames in order, using a training trick called causal diffusion forcing (different noise per frame).
•During sampling, a frame-wise schedule with “causal uncertainty” lets each new frame use partially cleaned-up past frames, cutting latency a lot.
•On HumanML3D and SnapMoGen, CMDM gets better text–motion matching and smoother motion than strong baselines, while being much faster.
•It supports text-to-motion, long-horizon generation, and streaming at interactive speeds (up to 125 fps with the new sampling schedule).
•This approach opens doors for live avatars, games, and animation tools that need fast, realistic, and controllable human motion.

Why This Research Matters

Real-time, text-controlled motion helps power live avatars, virtual assistants, and game characters that need to move naturally on command. Faster, smoother generation reduces the lag you feel when directing a character, making interactions feel immediate and believable. Strong text–motion alignment means your instructions—like which arm to raise or which way to turn—are followed precisely. Long-horizon stability allows entire scenes to be created from a sequence of captions, saving time for animators and small studios. Because CMDM is lighter and faster, it fits better on practical hardware, opening the door to more accessible creative tools. These gains also support education, telepresence, and fitness coaching, where accurate, responsive motion matters.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you watch a dance, each move flows into the next, and it wouldn’t make sense to plan a jump by using information from a step that hasn’t happened yet? Computers that generate human motion face the same rule: time moves forward, so decisions should too. Before this paper, most of the best-looking motion came from diffusion models. They worked by cleaning noise from an entire motion sequence all at once, using information from both the past and the future frames. That helped quality, but it broke causality—the idea that the present should not peek at the future—and made real-time or streaming generation very hard.

At the same time, another family called autoregressive models played by the rules of time: they predicted the next frame only from what happened before. These were great for online or streaming uses (like a live avatar), but they often ran into trouble on long sequences. Tiny mistakes would snowball (a problem called exposure bias), causing feet to slide, poses to wobble, or the model to slowly drift away from the text description.

People tried patches. Some used vector-quantization (turning motion into tokens) and predicted tokens one by one. This gave strong sequence modeling but added discretization errors and still suffered from drift. Others brought diffusion into an autoregressive wrapper, but large diffusion heads and teacher forcing during training made inference slow and sometimes unstable. Meanwhile, purely diffusion-based motion systems kept excelling in visual quality but couldn’t be used for streaming because they needed to denoise the entire clip together with bidirectional attention.

The missing piece was a way to keep the strengths of diffusion (high fidelity, diversity, robustness) while respecting the arrow of time so that frames could be produced sequentially, quickly, and stably. Another gap was semantic grounding: making sure the motion truly follows the text at a fine-grained level (like which arm waves, or which direction someone walks). Without strong motion–language alignment, models might look smooth but ignore parts of the instruction.

This paper fills those gaps by building a pipeline that is causal end-to-end and semantically aligned. First, it learns a compact, meaningful, and causal motion representation (a latent space) that lines up with language. Then, inside this space, it trains a diffusion transformer that only looks backward in time (no peeking ahead). Finally, it speeds up generation with a frame-wise sampling schedule that lets each new frame be guided by partially cleaned-up past frames instead of waiting for fully finished ones.

Why should you care? Think of live-streaming VTubers, game NPCs that follow your typed commands instantly, or animation tools that let artists rough out a scene from text and refine it in real time. Causality means you can generate as you go. Diffusion means it looks good. Alignment means the motion actually does what the words say. The stakes are practical: lower latency, higher quality, and better control for everyday applications from games and film to telepresence and fitness coaching.

02Core Idea

Aha! Moment in one sentence: Make diffusion work in a strictly past-to-future way inside a language-aligned motion latent space, and speed it up by letting future frames learn from partially cleaned-up past frames.

Multiple analogies:

Choreographer’s notebook: Instead of editing the whole dance at once, the choreographer writes moves page by page, only using what’s already performed, and lightly revises upcoming moves as earlier ones get clearer.
Polishing an assembly line: Each part (frame) gets polished in order. Later parts start getting a light polish based on the shine of earlier parts, then receive deeper polishing as more context arrives.
Walkie-talkie with static: You hear messages over time with different amounts of static. You clean up what you heard so far and use that cleaned-up history to better understand and produce the next message.

Before vs. After:

Before: Diffusion gave great quality but cheated by looking at the future; autoregression respected time but drifted and was slow for long runs.
After: CMDM keeps diffusion quality while obeying causality, trains on per-frame noise so it’s robust to different “clarity levels,” and uses a smart sampling schedule that cuts latency for streaming.

Why it works (intuition, no equations):

If every frame gets its own noise level during training, the model learns to handle a timeline where some parts are clear and some are fuzzy—just like real generation where recent frames are clean and future ones are not. This makes the model naturally good at stepping forward in time.
Causal attention means no cheating: a frame can only read the past. This prevents information leaks and supports real-time rollout.
Doing all this in a compact, language-aligned latent space makes the task simpler: fewer numbers to juggle and stronger guidance from text.
The frame-wise sampling schedule lets you reuse partially cleaned frames as helpful hints, which reduces the number of full denoise passes you need.

Building blocks (Sandwich explanations in dependency order):

🍞 Hook: Imagine telling a friend a story, one sentence at a time. You can’t use sentences you haven’t said yet to decide the next line. 🥬 Causal Autoregression: It’s a way to predict the next frame only from earlier frames.

How it works: (1) Read past frames; (2) Predict the next one; (3) Append it and repeat.
Why it matters: Without it, you can’t stream or respond in real time, and you might accidentally use future info. 🍞 Anchor: A live avatar moves frame by frame based on what it just did, not what it will do later.

🍞 Hook: Songs feel good when each note follows naturally from the last. 🥬 Temporal Coherence: It’s the smooth flow so motions don’t jump or jitter.

How it works: (1) Keep consistent speed and pose changes; (2) Avoid sudden flips; (3) Maintain direction.
Why it matters: Without coherence, animations look robotic or broken. 🍞 Anchor: A “walk forward” should not suddenly flip to “walk backward” unless the text says so.

🍞 Hook: You know how translators turn a book into a movie script so the scenes match the words? 🥬 Motion-Language-Aligned Causal VAE (MAC-VAE): It’s a compressor that turns motion into a smaller code aligned with text and only uses the past.

How it works: (1) Encode frames causally; (2) Align codes with a motion–language model; (3) Decode back to motion.
Why it matters: Without this, the text might say “wave left hand,” but the motion might wave the right. 🍞 Anchor: The phrase “sit down on a chair” lines up with a motion code that actually bends knees and lowers the hips.

🍞 Hook: In class, you can only use notes from earlier lectures to answer today’s quiz. 🥬 Causal Diffusion Transformer (Causal-DiT): It’s a transformer that cleans noisy motion codes frame by frame, only reading the past.

How it works: (1) Mask out future frames; (2) Use text as guidance; (3) Clean each frame’s noise in order.
Why it matters: Without causal masks, the model peeks ahead and can’t run live. 🍞 Anchor: To draw frame t, it reads frames 1..t, not t+1.

🍞 Hook: A coach adds different amounts of challenge to different students so all improve. 🥬 Diffusion Forcing: It trains with a different noise level per frame so the model learns to denoise in time order.

How it works: (1) Give each frame its own noise; (2) Force the model to clean it using only earlier frames; (3) Repeat across times.
Why it matters: Without per-frame noise, the model isn’t ready for the messy, uneven clarity of real-time generation. 🍞 Anchor: Early frames might be mostly clean; future frames are fuzzier. The model learns to handle both.

🍞 Hook: Think of CMDM as a team where each player knows their role and follows the whistle of time. 🥬 CMDM: It’s the full pipeline that combines causal latent codes, a causal diffusion transformer, and frame-wise sampling.

How it works: (1) MAC-VAE makes aligned causal codes; (2) Causal-DiT denoises in order; (3) FSS speeds up streaming.
Why it matters: Without this combo, you get either great quality without streaming or streaming without stability. 🍞 Anchor: Type “walk forward and wave,” and your character does it smoothly, live.

🍞 Hook: When filming a dance, you don’t wait for the whole dance to be perfect before deciding the next shot. 🥬 Frame-Wise Sampling Schedule (FSS): It decides how clean each frame should be at each step, letting new frames start from partly cleaned history.

How it works: (1) Keep past frames low-noise; (2) Keep future frames higher-noise; (3) Gradually reduce noise in order.
Why it matters: Without FSS, you waste time fully cleaning every frame before moving on, slowing streaming. 🍞 Anchor: While finishing frame t, you’ve already started gently shaping frame t+1 using the polished context of earlier frames.

03Methodology

At a high level: Text → MAC-VAE (causal, language-aligned latent codes) → Causal-DiT with causal diffusion forcing (train) → Frame-wise sampling schedule (inference) → Motion decoder → Final motion.

Step A: Build a causal, language-aligned latent space (MAC-VAE)

What happens: Motion sequences are encoded with causal 1D conv/ResNet blocks so each latent $z_t$ only depends on x_≤t. The latents are aligned to a motion–language model (Part-TMR) using two losses: one pushes matching features closer (cosine), another matches their internal geometry (distance matrix). The decoder is also causal, so reconstruction can stream.
Why this step exists: Raw motion is big and noisy. A compact, meaningful, and causal code makes it easier to generate long, smooth motions that follow text. Without alignment, the motion may ignore fine-grained instructions.
Example: For “a person sits on a sofa,” the latent code gradually lowers the pelvis and bends knees over time; the alignment ensures this code matches the text concept of “sit.”

Step B: Teach the model to clean noise in time order (Causal Diffusion Forcing + Causal-DiT)

What happens: During training, every frame gets its own noise level. The Causal-DiT uses a causal mask (only past frames are visible), text cross-attention (DistilBERT embeddings), AdaLN (to inject the per-frame timestep/noise info), and ROPE (stable positional encoding) to predict and remove the noise. The loss asks the model to correctly predict the added noise per frame.
Why this step exists: Standard diffusion cleans the whole clip at the same noise level, which breaks causality. Per-frame noise teaches the model how to denoise while moving forward in time, so it’s robust when earlier frames are clean and later ones are not.
Example: In “walk, then wave,” early walking frames might be less noisy, while the upcoming wave frames are noisier. The model learns to clean walking first and prepare to clean the wave later, using only past context.

Step C: Generate quickly and stably (Frame-Wise Sampling Schedule, FSS)

What happens: At inference, you assign lower noise to past frames and higher noise to future ones. As you denoise frame t, you’ve already started denoising frame t+1 a bit, guided by the partially cleaned history. This cuts the total steps needed for each new frame.
Why this step exists: Fully finishing one frame before touching the next wastes time and can accumulate exposure bias. Reusing partially cleaned past frames as context speeds up generation and keeps transitions smooth.
Example with pretend numbers: Suppose each frame normally needs 50 steps. With FSS and uncertainty scale L=2, frame t+1 starts getting cleaned when frame t is at step 48. That way, by the time you finish t, t+1 is already partway done.

Secret sauce: Three pieces click together.

A causal, language-aligned latent space keeps the problem compact and faithful to the text.
A causal diffusion transformer learns to move forward in time without peeking, making it streamable and stable.
A frame-wise sampling schedule reuses partial progress to slash latency while keeping motions smooth.

Concrete recipe summary:

Train MAC-VAE: Encode motions causally, align with Part-TMR features, decode causally.
Train Causal-DiT with diffusion forcing: Add different noise to each frame; predict and remove it using only past frames + text.
Inference for streaming: Start from noise; denoise frames with FSS so each next frame benefits from the polished context of its predecessors; decode to get 3D motion.

What breaks without each step:

Without MAC-VAE: Heavier data, weaker text grounding; more drift and mismatches.
Without causal masks: Future leakage; can’t stream; training and testing don’t match.
Without per-frame noise: Model unprepared for uneven clarity; less robust over long horizons.
Without FSS: Higher latency and more cumulative errors in long sequences.

Mini walkthrough:

Input: “The person walks forward, then waves with the left hand.”
Encode: MAC-VAE makes causal codes that align “walk” then “left-hand wave.”
Train: Causal-DiT sees walk frames with mild noise and wave frames with stronger noise, learning to denoise in order using text.
Infer: FSS starts gently denoising the wave before walk is 100% finished, giving a smooth, timely transition from walking to waving at low latency.

04Experiments & Results

The test: What did they measure and why?

Text–motion alignment: R-Precision (how often the right text ranks on top) and CLIP-Score (text–motion similarity). We want higher.
Realism: FID (lower is better) compares generated motions to real ones.
Diversity: Multi-modality (higher is better) checks variation from the same prompt.
Long-horizon smoothness: Transition metrics like Peak Jerk (PJ) and Area Under Jerk (AUJ) (lower is smoother).
Speed: Frames per second (fps) and per-token latency (lower is faster) for practical use.

The competition: Strong baselines from three camps.

VQ-based (e.g., T2M-GPT, MoMask, MMM): Treat motion like tokens.
Diffusion-based (e.g., MDM, MLD, MotionLCM, SALAD, StableMoFusion, EnergyMoGen): Great fidelity but typically non-causal.
Autoregressive (e.g., MARDM, MotionStreamer): Causal but can drift and be slow.

The scoreboard with context:

HumanML3D: CMDM with frame-wise sampling (FSS) gets R-Precision Top-1/2/3 = 0.588/0.778/0.860, FID = 0.068 (second best), CLIP-Score = 0.685 (best). Think of R-Top1 = 0.588 like beating other top students in matching the right caption to the motion; FID 0.068 is like getting a cleanliness score near the best in class; highest CLIP-Score means the motion follows the words most closely.
SnapMoGen: CMDM w/ FSS is again best overall: R-Top1/2/3 = 0.831/0.926/0.958, FID = 14.451 (lowest among compared), CLIP-Score = 0.702. This shows strong generalization to more expressive, long activities.
Long-horizon generation: Against FlowMDM (composition method) and MARDM (AR baseline), CMDM achieves better subsequence quality and smoother transitions on HumanML3D and SnapMoGen, with much lower AUJ/PJ on HumanML3D and competitive smoothness on SnapMoGen while avoiding static or frozen outputs reported for FlowMDM.

Speed and efficiency:

Model size and fps: CMDM (about 114M params total) runs at 28 fps with standard AR and up to 125 fps with FSS, versus MARDM (310M, 20 fps) and MotionStreamer (318M, 11 fps). Faster and lighter makes it fit better for interactive systems.
Latency per token (4 frames) on A100: CMDM w/ $AR ≈ 150$ ms; CMDM w/ $FSS ≈ 220$ ms for the first token then ~30 ms per next token (5– $12× faster$ for streaming). This is the practical impact of reusing partially denoised context.

Surprising findings:

Word-level text encoders (DistilBERT) beat sentence-level in this causal setup—local token cues help guide frame-by-frame generation.
The frame-wise schedule’s uncertainty scale L trades speed and smoothness; moderate L often gives the best transitions.
Causal components like AdaLN and ROPE contribute noticeably to long-horizon stability; removing them hurts FID and smoothness.

Big picture: CMDM tops or ties leaders in alignment and smoothness while slashing latency, demonstrating that you don’t have to choose between diffusion quality and causal, real-time generation.

05Discussion & Limitations

Limitations:

Depends on motion–language alignment quality: If the text is abstract or ambiguous, the learned alignment (e.g., from Part-TMR) might misguide motion details (like which arm or direction).
Very long sequences: Even with FSS, tiny artifacts can accumulate over many minutes; occasional re-anchoring or feedback could help.
Single-person focus: The current system doesn’t yet handle multi-person interactions or complex scene constraints.
Data/domain gaps: Unseen actions or rare styles might reduce realism without additional training data.

Required resources:

Training uses GPU(s) like an NVIDIA A100 and standard deep-learning stacks (Transformers, VAE). MAC-VAE plus Causal-DiT totals about 114M parameters in the reported setup. Text encoders (DistilBERT) and alignment backbones (Part-TMR) are pretrained components.

When not to use:

Multi-character choreography or close-contact interactions where spatial relationships are critical (not yet supported).
Highly abstract prompts with unclear body semantics (“move like a dream”)—alignment may struggle.
Ultra-high-precision biomechanical analysis (medical-grade kinematics) without extra constraints and validation.

Open questions:

Can we add online feedback (e.g., physics or contact checks) to auto-correct drift during very long scenes?
How to scale to multi-person, human–object, and environment-aware motion while keeping latency low?
Can stronger or broader motion–language pretraining reduce ambiguity and improve rare actions?
What are the best schedules (L, K) across devices to balance speed and smoothness in real products?
How to integrate user-in-the-loop editing (e.g., draw a path or key poses) while preserving causal diffusion’s stability?

06Conclusion & Future Work

Three-sentence summary:

CMDM is a causal motion diffusion framework that generates text-driven human motion one moment at a time, keeping diffusion’s high quality while respecting the arrow of time.
It learns a language-aligned causal latent space (MAC-VAE), denoises with a Causal-DiT trained by causal diffusion forcing, and speeds up streaming with a frame-wise sampling schedule.
Experiments on HumanML3D and SnapMoGen show state-of-the-art alignment and smoothness, plus big latency gains for real-time use.

Main achievement:

Proving that diffusion and strict causality can be unified to deliver high-fidelity, low-latency, long-horizon motion that follows text closely and runs fast enough for interactive applications.

Future directions:

Add feedback controllers and re-anchoring to push stability over multi-minute scenes; expand to multi-person and object-aware motion; strengthen motion–language pretraining and editing tools for creators.

Why remember this:

CMDM shows you don’t have to pick between beauty (diffusion quality) and discipline (causality). By teaching diffusion to think in time, and by reusing partial progress cleverly, we get motions that look right, feel smooth, follow the words, and arrive quickly enough for the real world.

Practical Applications

•Live streaming avatars that act out your typed or spoken commands instantly.
•Game NPCs that follow designer prompts to perform context-aware actions in real time.
•Rapid animation prototyping from text for indie creators and small studios.
•VR fitness or dance coaches that demonstrate motions smoothly and accurately on the fly.
•Telepresence characters that mirror intended gestures with low latency.
•Interactive storytelling where characters perform multi-step instructions over long scenes.
•Robotics simulation of human motions that needs causal, frame-by-frame planning (offline).
•Pre-visualization for film: generate long, caption-driven blocking that remains coherent.
•Educational tools that illustrate physical actions from descriptions (e.g., sports drills).
•Motion editing pipelines that start from text and then refine specific frames or transitions.

Version: 1