MIBURI: Towards Expressive Interactive Gesture Synthesis

M. Hamza Mughal; Rishabh Dabral; Vera Demberg; Christian Theobalt

MIBURI: Towards Expressive Interactive Gesture Synthesis

Intermediate

M. Hamza Mughal, Rishabh Dabral, Vera Demberg et al.3/3/2026

arXiv

Key Summary

•MIBURI is a system that makes a talking digital character move its body and face expressively in real time while it speaks.
•It listens to a live stream of speech-and-text tokens from a voice-capable language model (Moshi) and turns them directly into synchronized full-body gestures.
•The motions are split into body parts (face, upper body with hands, lower body) and packed into tiny tokens so they can be generated fast.
•Two cooperating transformers handle timing (when a gesture happens) and body details (what each part does) so the motions feel natural but stay low-latency.
•Extra training goals encourage variety and emotion so the character does not freeze into one average pose.
•The system is causal (uses only the past, not the future) and still keeps up with live conversation, achieving around 36 ms per frame on an RTX 3090.
•In studies, people preferred MIBURI’s gestures over strong real-time and offline baselines for naturalness and appropriateness in many cases.
•By tapping directly into Moshi’s internal tokens, MIBURI avoids slow steps used by older pipelines and responds more fluidly.
•This work moves embodied conversational agents closer to truly human-like back-and-forth interaction with speech, face, and body working together.

Why This Research Matters

Real conversations are more than words—hands, face, and body carry meaning, emotion, and rhythm. MIBURI brings that missing channel to AI, so digital helpers, tutors, and companions feel more present and understandable. Because it runs causally and in real time, it can keep up with you as you speak, not after you finish. By plugging directly into a live speech-text model, it stays synced without heavy, slow pipelines. This makes embodied AI more practical in classrooms, customer support, healthcare coaching, and AR/VR. Over time, such systems can improve trust, engagement, and learning by communicating the way people naturally do.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how talking to a friend isn’t just about words? Their hands move, their face reacts, and their whole body helps the story land. Computers that talk today mostly skip that part.

🥬 The concept: Embodied Conversational Agents (ECAs) are computer characters that not only speak but also move their face and body like people do. They try to communicate with voice, gestures, and expressions together so conversations feel natural.

How it works (before this paper):

Many chat systems used only text or voice. No full-body movement.
Some older systems stitched together preset animations (like flipping through a small dance-card). Fast but stiff.
Newer research learned gestures from lots of data and made beautiful motions. But they usually needed a peek into the future speech to work well, so they ran slowly and couldn’t respond live.

Why it matters: Without gestures, conversations feel robotic. Hands and face add meaning, rhythm, and emotion—helping listeners understand and stay engaged.

🍞 Anchor: Imagine a voice assistant that nods, smiles, and uses its hands while explaining a recipe. You’d track the steps more easily than with a flat voice alone.

The world before:

Text and voice AIs grew powerful at language, but their bodies were missing. People tried two paths:
- Rule-based: If the sentence says “big,” raise hands wide. These ran in real time but looked repetitive.
- Generative models: Learned rich, lifelike gestures from data (diffusion, masked transformers). These looked great offline but needed future audio and lots of compute—too slow for live chats.

The problem:

We want an agent that can talk and move at the same time, naturally, with minimal delay. But expressive models often depend on future context; fast models often look dull.

Failed attempts:

Heavy pipelines: LLM → text → TTS audio → external audio embedding → gesture model. This adds latency at each step.
Faster models that still peek at the future: They’re low-latency per step, but not truly live because they wait for future windows to decide motions.
Seed-gesture hacks: Start generation with a hand-picked snippet to stabilize motion. Works in demos, not robust for open conversation.

The gap:

A system that is both causal (only past info), real-time (low-latency), and expressive (rich, varied gestures), all synchronized with live speech.

🍞 Hook: Imagine a chef cooking orders as they arrive—no menu planning hours ahead—and still plating dishes that look and taste amazing. That’s the challenge here.

🥬 The concept: A causal framework is a system that decides the next move using only what already happened, not what happens in the future. Real-time adds a strict ‘do it fast now’ constraint on top.

Why it matters: If the agent waits, conversations feel awkward. If it rushes with low-quality motions, it feels fake. We want both speed and expression.

🍞 Anchor: A live news anchor can’t stop mid-sentence to plan gestures for the next minute. They move as they speak, in the moment, but still look natural.

Real stakes in daily life:

Better tutoring: A virtual teacher pointing, nodding, and gesturing rhythm can keep students engaged.
Customer help: A friendly avatar that gestures can calm nerves and convey empathy.
Accessibility: Clear gestures support speech comprehension for people who rely on visual cues.
Social robotics and AR: Human-like motion builds trust and smoother cooperation.

This paper’s promise:

MIBURI plugs directly into a speech-text model’s internal stream (Moshi), skipping extra slow steps.
It breaks gestures into tiny, body-aware tokens and predicts them with two small, smart transformers—one for timing, one for part-level details.
Extra training goals keep the motions lively and avoid ‘average pose’ freezing.
Result: live, synchronized, expressive gestures that people prefer over several strong baselines.

02Core Idea

🍞 Hook: You know how musicians in a band lock into the same groove—drums keep the beat while guitar adds the flavor? Each does a different job at the same time so the song feels alive.

🥬 The concept (aha in one sentence): MIBURI makes gestures live and expressive by splitting the job into two fast steps—first decide when to move (timing), then fill in what each body part does (details)—all directly guided by a live stream of speech-and-text tokens from Moshi.

How it works (overview):

Tap into Moshi’s live tokens that already mix sound and meaning.
Turn body motion into tiny, part-specific tokens using body-part aware codecs (upper body/hands, lower body, face).
Use one transformer to predict the next time-slice token (temporal), then another to fill in per-part details (kinematic).
Add training helpers so the agent doesn’t collapse into static, bland poses.

Why it matters: If you don’t separate timing from details, you need a big, slow model with a long memory. Splitting makes it lean, quick, and expressive.

🍞 Anchor: Think of a choreographer (timing) and a coach for each dancer’s style (details). Together they produce a smooth show, even when the music changes live.

Three analogies for the same idea:

Orchestra: The conductor keeps time (temporal transformer), while each section (strings, brass, percussion) plays its parts with nuance (kinematic transformer).
Comics: First, you sketch the panels’ layout and pacing (temporal). Then you ink each character’s pose and facial expression (kinematic).
Cooking show: The host sets the pace of steps (temporal), and the assistants season each dish (kinematic), keeping flavor without slowing the show.

Before vs. after:

Before: Pipelines chained many steps (LLM → TTS → audio features → gesture model), adding delay. Expressive offline methods waited for future context; real-time ones looked stiff.
After: MIBURI plugs straight into Moshi’s internal tokens, skips extra encoding, and uses a two-step, two-transformer design to stay fast yet lively—no future peek needed.

Why it works (intuition, no equations):

Direct conditioning: Moshi already encodes what is said and how it sounds. Using those tokens directly gives precise timing and meaning without re-encoding audio.
Factorization: Timing and per-part detail are entangled but different tasks. Decoupling them lowers the attention window and speeds inference while preserving structure.
Hierarchy in tokens: Residual vector quantization stacks coarse-to-fine motion codes, so the model can add detail in stages without losing speed.
Expressiveness helpers: Contrastive signals and a ‘speaking vs. listening’ head push the model to vary motion appropriately and avoid freezing.

Building blocks (each with a mini sandwich):

Moshi token stream
- Hook: Imagine getting live subtitles plus tone-of-voice cues for every moment.
- Concept: Moshi emits synchronized speech-and-text embeddings as it talks and listens, in tiny time steps.
- Why it matters: No extra detours; MIBURI rides this stream to stay in sync.
- Anchor: Like following a karaoke line that also tells you when to nod.
Body-part aware gesture codecs
- Hook: When you watch someone, your brain separately tracks their hands, feet, and face.
- Concept: The system encodes motion for upper body/hands, lower body, and face into compact tokens, preserving each part’s style.
- Why it matters: Different parts move at different scales and tempos; separating them keeps details crisp and fast.
- Anchor: Upper body may punch out beats; feet stay subtle; face reacts instantly.
Residual vector quantization (RVQ)
- Hook: Think of drawing: first a rough sketch, then layers of detail.
- Concept: RVQ compresses motion into multiple stacked codebooks, each adding finer detail.
- Why it matters: Keeps small, fast tokens while recovering subtle motion like finger and eyebrow twitches.
- Anchor: Start with a basic arm swing, add wrist flick, then fingers.
Two-dimensional transformers (temporal + kinematic)
- Hook: One friend keeps time; another adds style.
- Concept: Temporal transformer picks the next time-slice token; kinematic transformer fills in per-part levels at that same time.
- Why it matters: Limits context and speeds up, yet respects both timing and body structure.
- Anchor: Decide to nod now (temporal), then choose head angle and eyebrow shape (kinematic).
Auxiliary objectives
- Hook: Practice scales to play jazz solos well.
- Concept: Contrastive loss (via Gumbel-Softmax) pushes variety; a voice-activity head learns when to gesture big or stay calm.
- Why it matters: Prevents boring average poses and wrong-time motions.
- Anchor: Subtle listening posture vs. animated speaking gestures.

03Methodology

At a high level: Live speech/text tokens → temporal transformer (when to move) → kinematic transformer (what each part does) → gesture tokens → decoder to full-body motion.

Step 0: Stream the right input (Moshi tokens)

What happens: As the agent and user talk, Moshi outputs tiny, time-aligned speech and text embeddings that capture meaning and prosody.
Why it exists: This skips building extra audio features (which would slow us down) and keeps everything synchronized.
Example: In 0.08 seconds, Moshi emits a new token; MIBURI uses it to plan the next 2 motion frames.

Step 1: Encode motion as body-part tokens (gesture codecs)

What happens: Past motion data is learned as discrete tokens using separate codecs for upper body/hands, lower body (plus contacts), and face. Each codec builds a stack of coarse-to-fine motion codes (RVQ) across tiny time windows (2 frames) to keep latency low.
Why it exists: Body parts move differently and carry different speech links. Splitting them preserves fine details while keeping tokens small and fast to predict.
Example: A wave is captured as a few upper-body tokens (arm sweep), plus face tokens (smile), with little change in lower body.

Step 2: Decide the next time-slice (temporal transformer)

What happens: The temporal transformer is causal: it looks only at previously predicted gesture tokens (summed across codebook levels), plus the current and recent Moshi speech/text tokens and the character identity embedding. It predicts the first-level token for the current time step.
Why it exists: Someone needs to keep the beat. This picks ‘move now’ vs. ‘wait,’ aligned with speech rhythm, without needing the future.
Example: On a stressed syllable, it chooses a token that starts a beat gesture (like a small downward chop) at this exact moment.

Step 3: Fill in per-part details (kinematic transformer)

What happens: Holding time fixed, the kinematic transformer autoregressively predicts the next codebook levels for that time step, conditioned on the temporal context (from Step 2), Moshi tokens at this instant, identity, and already-decoded levels.
Why it exists: This lets the system add complexity—arm angle, wrist bend, finger splay, eyebrow lift—without expanding the temporal context.
Example: It enriches the basic chop with a slight wrist turn and a matching head nod, while keeping feet stable.

Step 4: Decode tokens to motion frames

What happens: The predicted tokens are passed through the body-part decoders to reconstruct 3D joints (SMPL-X parameters) for body and face.
Why it exists: Tokens are compact; decoders turn them back into smooth, high-detail motion.
Example: Two new frames (at 25 FPS) appear every 0.08 seconds, in sync with speech.

Step 5: Keep it expressive (auxiliary objectives)

What happens: During training, a contrastive objective compares generated token latents with ground-truth latents over segments, encouraging variety and alignment. A small binary head classifies ‘speaking’ vs. ‘listening’ states to reduce phantom gestures while listening.
Why it exists: Autoregressive models can drift or collapse to average poses. These extras push diverse, appropriate motion.
Example: The agent becomes more animated when speaking passionately, and calmer when listening.

Step 6: Make it real-time (engineering choices)

KV-cache: Stores attention keys/values from prior steps so transformers don’t recompute history.
Capped context: Self-attention sees a short past (e.g., 25 tokens), while cross-attention to speech/text uses a bit longer window (e.g., 50), balancing quality and speed.
Tiny time-chunk: Each token covers only 2 frames to keep latency tiny.
Sampling: Use nucleus (top-p) sampling and a gentle temperature to keep gestures varied but stable. Classifier-free guidance boosts alignment with Moshi features. Lower-body cross-attention can be masked to save time since feet relate less to speech.

Why each piece matters (what breaks without it):

No Moshi tokens: You’d need extra audio encoders, adding latency and risking sync issues.
No part-wise codecs: Motions blur; fingers and face lose crisp detail.
One big transformer: Context window grows, slowing inference and hurting training stability.
No auxiliary losses: Motions flatten into average poses; listening vs. speaking distinctions get fuzzy.
No KV-cache or short windows: Latency spikes; the agent lags behind the voice.

Secret sauce:

Plugging directly into Moshi’s live, semantically rich token stream.
Two-dimensional generation (time first, details second) that matches how humans sense rhythm then refine motion.
Token design (RVQ, 2-frame windows) chosen for the real-time constraint rather than just offline scores.
Minimal but effective training extras to preserve expressiveness.

Mini sandwich recaps for key tools:

KV-cache
- Hook: Like keeping notes so you don’t reread the whole book every time.
- Concept: Stores past attention states to skip recomputation.
- Why it matters: Cuts latency so gestures keep up with speech.
- Anchor: The agent can nod right on the beat instead of half a second late.
Top-p sampling
- Hook: Picking from the top of a menu of likely moves.
- Concept: Samples from the most probable gestures without being too predictable.
- Why it matters: Prevents robotic repetition while staying sensible.
- Anchor: Sometimes a chop, sometimes a point—always fitting the moment.
Classifier-free guidance
- Hook: A quiet hint that says ‘stay on theme.’
- Concept: Nudges sampling toward better alignment with speech/text context.
- Why it matters: Keeps gestures semantically on track.
- Anchor: Emphasizes ‘huge’ with bigger hands instead of a shrug.

04Experiments & Results

The test: What did they measure and why?

Naturalness (user studies): Do people prefer how it looks and feels?
Speech–gesture alignment (BeatAlign): Are gestures timed with voice rhythm?
Distribution closeness (Fréchet Gesture Distance, FGD): Do generated motions look statistically like real ones?
Diversity (L1 divergence): Does motion vary, or collapse into a statue?
Face accuracy (Facial-MSE): Are expressions close to ground truth when predicted?
Latency: Is it fast enough for live conversation?

The competition: Who did they compare to?

Offline expressive baselines: EMAGE, RAG-Gesture, CaMN (often non-causal and/or heavy compute).
Real-time baselines: GestureLSM, MambaTalk (but typically non-causal or require seeds). They also built causal versions of some for fairness.

Scoreboard with context (highlights):

User study: People significantly preferred MIBURI over EMAGE and GestureLSM for both naturalness and appropriateness in many cases, though ground truth still wins overall (as expected).
Multi-speaker setting (23 speakers, BEAT2): MIBURI achieved strong FGD and BeatAlign scores—like getting an A when others are around B/C—while staying real-time and causal. Crucially, causalized versions of other methods dropped in quality, showing they relied on future context.
Single-speaker (Scott): MIBURI’s BeatAlign was competitive with top baselines, and adding face improved expressivity; some baselines that use seed sequences had advantages in FGD here, which is reasonable because seeding makes matching easier.
Embody3D (dyadic interactions): Finetuned MIBURI outperformed baselines across metrics, suggesting the causal token-based approach transfers well to more natural, two-person conversations.
Latency: Around 36 ms per frame in the online demo on an RTX 3090 (includes model + rendering). Compared head-to-head, MIBURI had lower or comparable wall-clock time than real-time baselines and clearly beat diffusion-style systems that need to process full windows before output.

Why these numbers matter (plain meaning):

Better $BeatAlign ≈ gestures$ landing on syllable beats and intonation peaks, like a good storyteller’s hands.
Lower $FGD ≈ motions$ statistically closer to real human gestures—less uncanny valley.
Higher $diversity ≈ fewer$ frozen or average poses—more life.
Lower $latency ≈ immediate$ , flowing conversation rather than awkward pauses.

Surprising findings:

‘Bigger isn’t always better’: A leaner MIBURI matched or beat a larger variant under real-time constraints, showing the architecture choice (two-dimensional, RVQ tokens, direct Moshi conditioning) matters more than just scaling.
Simple swaps hurt: Replacing Moshi tokens with standard wav2vec audio features made metrics worse and added compute, confirming the value of tapping Moshi directly.
One-transformer design underperformed: Merging temporal and kinematic into a single model doubled step time and hurt alignment/diversity, validating the factorization design.

Bottom line: MIBURI consistently keeps pace with live speech, stays causal, and still delivers motions that users and metrics judge as more natural and better-timed than several strong baselines.

05Discussion & Limitations

Limitations (honest look):

No partner-body perception: The current model generates only the agent’s motion. It does not yet see and respond to a user’s gestures in a fully bi-directional, physical sense.
Causality trade-off: Because it cannot peek into the future, some nuanced, meaning-rich gestures that anticipate words are harder to capture than in offline models.
Seedless by design: Great for generality, but some baselines gain short-term smoothness from seed gestures; MIBURI must stabilize from scratch.
Lower-body subtlety: Since feet relate less to speech, the system often deprioritizes them for latency. Future versions could add environment-aware stance or stepping.

Required resources:

A GPU similar to an RTX 3090 for real-time performance.
Integration with a live speech-text model (Moshi or similar) to supply synchronized token streams.
Training on multi-speaker gesture datasets (e.g., BEAT2, Embody3D) to generalize well.

When not to use:

Long, pre-recorded productions where you can afford offline processing and want the absolute richest, future-aware gestures.
Settings that require precise choreography tied to known future beats (e.g., scripted performances) where offline diffusion can shine.
Scenarios where the agent must track and mirror a human partner’s full-body motion in detail (not yet supported).

Open questions:

Intent modeling: Can we first infer a shared communicative intent and then jointly generate speech and gestures in real time without breaking causality?
Dyadic grounding: How to add user-body perception so the agent’s gestures respond to the partner’s gaze, posture, and hand cues?
Semantic depth under causality: How far can we push fine-grained, content-rich gestures without future context—are better token designs or small lookahead buffers helpful?
Personalization: How to learn a user’s or character’s signature gesture style safely and quickly during a conversation?

06Conclusion & Future Work

Three-sentence summary:

MIBURI is a causal, real-time system that turns live speech-and-text tokens into synchronized, expressive full-body gestures and facial expressions.
It works by encoding motion into body-part tokens and generating them with two small transformers—one for timing, one for details—guided directly by Moshi’s internal stream.
With smart training objectives and efficient engineering, it keeps latency low while beating or matching strong baselines in user studies and metrics.

Main achievement:

Showing that you don’t have to choose between ‘fast’ and ‘expressive’: a carefully factored, token-based, Moshi-conditioned design can deliver both at once, live.

Future directions:

Add partner-body perception to support fully bidirectional, gesture-aware conversations.
Explore intent-first generation so gestures can anticipate speech naturally while remaining real-time.
Enrich lower-body and environment interactions (stance shifts, stepping, object-aware motions).

Why remember this:

It’s a blueprint for embodied AI that feels present—talking and moving with you, not after you. By splitting timing from details and plugging straight into a live speech-text stream, MIBURI brings avatars closer to the rhythm, meaning, and warmth of human conversation.

Practical Applications

•Virtual tutors that gesture, point, and nod in sync to keep students engaged and clarify tricky ideas.
•Customer-service avatars that use open hand gestures and warm expressions to de-escalate stressful situations.
•Healthcare coaching assistants that model calming breaths and encouraging postures during sessions.
•AR/VR guides that direct your attention with head turns and hand cues while explaining tasks.
•Language-learning partners that highlight intonation with beat gestures, improving pronunciation and rhythm.
•On-device talking characters for accessibility that reinforce speech with clear visual cues.
•Storytelling companions that use facial expressions and hand movements to make narratives vivid.
•Training simulations (e.g., retail, hospitality) where trainees interact with lifelike, responsive agents.
•Interactive museum or kiosk hosts that naturally gesture toward exhibits as they speak.
•Social robots that maintain fluid back-and-forth with human-like timing and motion.

Version: 1