SARAH: Spatially Aware Real-time Agentic Humans

Evonne Ng; Siwei Zhang; Zhang Chen; Michael Zollhoefer; Alexander Richard

SARAH: Spatially Aware Real-time Agentic Humans

Intermediate

Evonne Ng, Siwei Zhang, Zhang Chen et al.2/20/2026

arXiv

Key Summary

•SARAH is a real-time system that makes virtual characters move their whole bodies naturally during a conversation while knowing where the user is.
•It listens to both people’s audio and watches the user’s floor-projected head position to decide how the agent should gesture, turn, and look.
•A causal transformer-based VAE compresses motion into fast-to-handle chunks so the system can stream smoothly without peeking into the future.
•A flow matching model then generates the next bit of motion in that compact space, keeping movements smooth, expressive, and on time.
•A gaze guidance control lets you choose how much eye contact the agent should make, from shy to super engaged, without retraining the model.
•Using a simple, Euclidean motion representation (with 3D points) makes training stable and helps control feet, hands, and head precisely.
•On the Embody 3D dataset, SARAH hits over 300 FPS and beats strong non-causal baselines on motion quality while matching their gaze alignment—yet runs 3× faster.
•The method works live on a VR headset, supporting real-time, back-and-forth conversations that feel much more human.
•Unlike retrieval methods, SARAH generates new motion that fits the current user and speech instead of copying an old clip.
•You can deploy SARAH in VR companions, telepresence, customer support avatars, social robots, games, and more.

Why This Research Matters

Real conversations aren’t just words; they are full-body signals—turns, nods, steps, and glances. SARAH brings those signals to virtual agents so they feel present and respectful, reacting to where you stand and how you speak. Because it runs causally and very fast, it works on real devices like VR headsets, not just in offline demos. The gaze dial means different people and cultures can choose the eye contact that feels right, boosting comfort and trust. This unlocks better tutors, assistants, and companions who don’t just talk at you—they interact with you. It also raises the bar for social robots and telepresence, making remote interactions feel more human. In short, SARAH helps digital humans act more like real ones, which can make many experiences friendlier and more effective.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how talking to someone feels weird if they never turn toward you or make eye contact, even when you walk around them? It breaks the feeling that they’re really there with you.

🥬 Filling (The Actual Concept)

What it is: This paper builds SARAH, a system that makes virtual agents move in a people-smart way, turning, gesturing, and looking at you while you talk—live and in real time.
How it works (story of why we need it):
1. The world before: Many gesture systems could align hand waves with speech sounds, but they ignored where the other person was standing. So agents stared straight ahead and felt like cardboard cutouts.
2. The problem: Real conversations are spatial. People shift, step closer or farther, pivot their torsos, and adjust eye contact. Agents must react to the user’s movement while keeping gestures in sync with speech—and they must do it causally (no peeking at future frames) so it can run live on a headset.
3. Failed attempts:
  - Audio-only gesture models forgot the partner’s position, so they couldn’t orient the body toward the user.
  - Some powerful generative models made nice motion but were too slow or needed future frames, so they couldn’t run live.
  - Dyadic datasets often showed people sitting still, so models didn’t learn how to react to walking or repositioning.
4. The gap: We needed a streaming method that is both spatially aware and controllable, so users can pick how much eye contact they want—without retraining the whole model.
5. The stakes: If your VR tutor never looks at you when you speak, or your assistant doesn’t turn as you move, the illusion of presence collapses. Natural nonverbal cues—turning, facing, and gaze—make digital humans feel real.
Why it matters: Without spatial awareness and control, agents feel robotic and awkward. With them, they become engaging partners you can comfortably talk to.

🍞 Bottom Bread (Anchor) Imagine a VR museum guide. As you walk to the left of a statue and ask a question, the guide smoothly turns, keeps a comfortable distance, gestures while explaining, and makes just the right amount of eye contact—because SARAH is reacting to your position and voice in real time.

🍞 Top Bread (Hook) Imagine drawing with dot-to-dot puzzles: simple points help you trace a clear picture. If the dots are in good places, the drawing is easy; if not, it’s a mess.

🥬 Filling (The Actual Concept: Euclidean Motion Representation)

What it is: A simple way to describe body motion using 3D points (and stable orientations) in normal XYZ space instead of tricky angle rotations.
How it works:
1. Represent each joint with a small 3D shape so its position and facing direction are easy to recover.
2. Add a lightweight mesh “shell” to capture surface geometry.
3. Normalize motion so the agent starts at a consistent origin and facing.
Why it matters: Complex joint angles can wobble and drift; simple 3D positions make learning faster and end-effectors (hands, feet, head) easier to control.

🍞 Bottom Bread (Anchor) Think of placing LEGO bricks on exact studs (3D points) instead of guessing twisty angles—your model stays sturdy and looks right.

🍞 Top Bread (Hook) You know how a librarian stores big books in small, labeled boxes so it’s easy to pull out what you need fast?

🥬 Filling (The Actual Concept: Variational Autoencoder, VAE)

What it is: A VAE is a compressor–decompressor that learns a compact code (latents) for complex motion and can reconstruct it later.
How it works:
1. Encoder squishes detailed motion into short latent tokens.
2. Decoder unpacks those tokens back into full motion.
3. It learns to keep what’s important and ignore noise, so generation becomes easier and faster.
Why it matters: If we try to predict full motion directly, it’s slow and unstable. In a neat latent space, generation is quicker and more consistent—perfect for streaming.

🍞 Bottom Bread (Anchor) Like zipping a huge video into a small file that still looks great when you uncompress it, SARAH zips motion into latents for speedy real-time use.

🍞 Top Bread (Hook) Picture writing a story one sentence at a time, only using what you’ve already written—no peeking at future pages.

🥬 Filling (The Actual Concept: Causal Transformer)

What it is: A transformer that only looks at the past, not the future, so it can run live.
How it works:
1. It reads tokens (like motion or audio features) in time order.
2. A causal mask blocks any lookahead.
3. It updates predictions step by step, streaming as new inputs arrive.
Why it matters: If a model needs future frames, it can’t run on a headset in real time. Causality makes live conversations possible.

🍞 Bottom Bread (Anchor) Like a sports commentator who only describes plays as they happen, the causal transformer reacts now without spoilers from the future.

🍞 Top Bread (Hook) Imagine a dance coach guiding a student from a random starting pose into a graceful routine, smoothly and on beat.

🥬 Filling (The Actual Concept: Flow Matching Model)

What it is: A generator that learns how to push a noisy guess toward a realistic motion, step by step, guided by context.
How it works:
1. Start from a noisy latent.
2. Learn a “velocity field” that nudges it toward true motion.
3. Condition on what matters (user position, both audios, gaze score) to steer the motion appropriately.
Why it matters: This keeps motion smooth, expressive, and aligned with the conversation—fast enough for real-time.

🍞 Bottom Bread (Anchor) It’s like turning a scribble into a clean sketch by following arrows that show where to refine next until the picture looks just right.

🍞 Top Bread (Hook) Think about chatting with a friend: you both listen and talk, and your bodies react to each other.

🥬 Filling (The Actual Concept: Dyadic Audio Conditioning)

What it is: Using features from both people’s audio so the agent can time gestures and reactions to who is speaking and how.
How it works:
1. Extract robust speech features (like rhythm and emphasis) from the user and the agent.
2. Feed both into the motion generator.
3. The model learns when to gesture more (speaking) or adopt attentive posture (listening).
Why it matters: Audio-only from one speaker can’t reveal turn-taking or backchanneling; dyadic audio helps the agent behave like a real partner.

🍞 Bottom Bread (Anchor) When you get excited and speed up your words, your hands move more; when you pause, your friend nods. Dyadic audio helps the agent do that.

🍞 Top Bread (Hook) Have you ever adjusted how much eye contact you make—more with a best friend, less with a shy classmate?

🥬 Filling (The Actual Concept: Gaze Guidance Mechanism)

What it is: A simple dial that lets users choose how strongly the agent should look toward them.
How it works:
1. Measure a gaze score based on head facing vs. user direction.
2. During training, the model learns natural gaze behavior from data.
3. At runtime, a guidance signal gently steers gaze intensity up or down—without retraining.
Why it matters: People have different comfort levels and cultural norms. Control makes the agent feel respectful and personable.

🍞 Bottom Bread (Anchor) In a classroom game, you might pick “medium eye contact” so the agent feels attentive but not intense. Slide the dial, get the feel you want.

02Core Idea

🍞 Top Bread (Hook) Imagine a friendly robot actor who hears both of you, sees where you are standing, and performs the right moves on the fly—no script, no delay.

🥬 Filling (The Actual Concept)

The “Aha!” in one sentence: Learn natural, spatially-aware conversation motion from data, then steer eye contact at runtime—while keeping everything causal and fast by generating motion in a compact latent space.
Multiple analogies (three ways):
1. Orchestra analogy: The audio from both people is the sheet music, the user’s position is the conductor’s baton, and the flow model is the ensemble that plays in time; the gaze guidance is the volume knob for the violins (eye contact) that you can turn up or down.
2. GPS analogy: The VAE compresses the big, messy city map into clear GPS instructions; the flow model follows the best route in real time based on traffic (audio) and destination (user position); gaze guidance is choosing whether to take the scenic view (more eye contact) or a quieter side street (less eye contact).
3. Puppetry analogy: The Euclidean representation is a sturdy puppet skeleton with easy-to-grab points; the causal transformer is the puppeteer who only reacts to what’s happened; the flow model is the choreography; the gaze dial is deciding how often the puppet looks at the audience.
Before vs After: • Before: Agents gestured to speech but ignored where you stood, or they needed to peek into the future to look smart, which meant they couldn’t run live. No control over eye contact. • After: Agents turn toward you, gesture naturally, and keep eye contact tuned to your preference—all in real time on a headset.
Why it works (intuition, no equations): • Compressing motion into a learned latent space makes the pattern of human movement more regular and easier to predict quickly. • Causal transformers prevent future-leaking, so the system can respond live. • Flow matching learns gentle pushes from noisy guesses toward realistic motion, keeping movement smooth and expressive. • Conditioning on user position plus both audios gives the model the social context (where is the partner, who is speaking, how strongly) that drives real conversational body language. • Gaze guidance separates learning (what looks natural) from control (what you want now), so users pick eye contact without breaking realism.
Building blocks (introduced once with Sandwich pattern):
1. Euclidean Motion Representation (already introduced above)
2. Variational Autoencoder (VAE) (already introduced)
3. Causal Transformer (already introduced)
4. Flow Matching Model (already introduced)
5. Dyadic Audio Conditioning (already introduced)
6. Spatially Aware Motion Generation 🍞 Hook: You know how you angle your body toward a friend to show you’re listening? 🥬 Concept: The model makes the agent’s full body react to the user’s place and speech in real time—turning, stepping, and gesturing appropriately.
  - How: It takes user position and both audios, then generates compact motion latents that decode into full-body motion, all causally.
  - Why: Without spatial awareness, agents feel stiff and disconnected. 🍞 Anchor: As you circle around the agent while asking a question, they pivot and keep facing you while gesturing in time with your words.
7. Gaze Guidance Mechanism (already introduced)

🍞 Bottom Bread (Anchor) Think of SARAH as a live stage performer who can hear the audience, see where they’re seated, adapt their focus level, and still hit every mark right on time—night after night.

03Methodology

At a high level: Input (user floor-projected head position + both audios) → [Causal VAE encodes and decodes motion into compact latent tokens that support streaming] → [Flow Matching model generates the next motion latents conditioned on position + audio + optional gaze score] → Output (full-body 3D motion), all in real time.

Step-by-step recipe

Inputs and preprocessing

What happens: The system reads two things: (a) the user’s 2D floor-projected head position over time, and (b) speech features from both user and agent (dyadic audio). Audio features come from a robust speech model that captures rhythm, pauses, and emphasis.
Why this step exists: Position tells the agent where to turn and how to orient; dyadic audio reveals who’s speaking and how strongly, guiding gesture timing and listening posture.
Example: The user moves two steps to the agent’s right while asking a question in a rising tone; the agent “hears” that rise and “sees” that new position.

Euclidean Motion Representation (already introduced)

What happens: Motion is represented by 3D points and stable orientations for joints, plus a light mesh shell for surface geometry, normalized to a consistent start.
Why this step exists: It makes training stable and precise control of hands, feet, and head easier; avoids angle ambiguities.
Example: During a step, the foot point stays planted with minimal slide; the head point rotates smoothly toward the user.

Causal Transformer-based VAE with interleaved latents

What happens: The VAE compresses and reconstructs motion using a causal transformer so it can stream. Latent tokens are interleaved at fixed time strides, and both encoder/decoder only attend to the past.
Why this step exists: Compression speeds up generation and keeps motion coherent over time; causality enables real-time use without future frames.
Example: For every small chunk of motion (like 4 frames), a new latent token summarizes those frames; the decoder can rebuild them on the fly.

Flow Matching generator in latent space

What happens: Starting from a noisy guess of the next latent, the flow model learns how to nudge it toward realistic motion, conditioned on user position, both audios, and optional gaze score.
Why this step exists: Predicting directly in raw motion is hard and slow; guiding a compact latent with small, smart pushes is stable and fast.
Example: The agent is mid-turn. The model refines the next latent so the turn completes just as the user finishes a sentence, then settles into a listening pose.

Strict causality and streaming autoregression

What happens: The model keeps a history of previously generated latent tokens and uses causal attention. It “inpaints” recent history each step to maintain temporal smoothness without explicitly feeding back decoded motion (which can cause collapse).
Why this step exists: Enforcing causality is essential for live systems; the history imputation keeps motion continuous as each new chunk is generated.
Example: Every 4 frames, the system adds a new latent, updates the smoothness of the last few latents, and renders motion without jumps.

Gaze guidance control at inference

What happens: A per-frame gaze score (from -1 facing away to +1 facing the user) can be passed in to gently pull head orientation toward the desired amount of eye contact using a guidance trick similar to “classifier-free guidance.”
Why this step exists: People want different eye contact levels, and this lets you adjust without retraining or breaking realism.
Example: Set the gaze dial to 0.8 for warm engagement; the agent looks toward the user more consistently but still blinks and glances naturally.

Decoding and rendering

What happens: The generated latents are decoded by the VAE decoder into full-body motion frames in the Euclidean representation; then a rendering system can turn those into a photorealistic avatar.
Why this step exists: The latent space is for speed; decoding gives the actual body poses that can be displayed.
Example: The final animation shows the agent stepping, gesturing, and facing the user’s new position in sync with the conversation.

What breaks without each step

No user position: The agent won’t know where to face.
No dyadic audio: The agent can’t time gestures to speech or switch into attentive listening.
No Euclidean representation: Feet may slide, hands may drift, and training may wobble.
No VAE compression: Generation slows and motion quality drops as the model struggles with raw high-dimensional data.
No flow matching: Movements become less smooth and expressive; timing suffers.
No strict causality: The system can’t run live on a headset.
No gaze guidance: You can’t adjust comfort levels of eye contact for different users or settings.

The secret sauce

Compressing motion into causal, interleaved latent tokens, then using a flow-matching generator conditioned on user position and both audios—plus a tiny, powerful gaze dial—lets SARAH react naturally and fast, without peeking into the future. It’s the combination that makes it both human-like and deployable.

04Experiments & Results

The test: What did they measure and why?

They measured five things to judge realism, smoothness, physical plausibility, expressiveness, and social alignment:
1. Fréchet Gesture Distance (FGD): How close the distribution of generated poses is to real human motion—lower is better.
2. FGD on acceleration: Captures motion dynamics and smoothness—lower is better.
3. Foot Slide: How often feet slip while on the ground—lower is better.
4. Wrist Variance: How lively and expressive the hand movements are—higher is better.
5. Head Angle alignment: How much the agent faces the user—higher is better. They also looked at speaking vs. non-speaking clips separately to ensure the agent behaves differently when talking or listening.

The competition: Who was compared?

Retrieval baselines: Random clip and Nearest Neighbor (NN) from the dataset.
Generative baselines: MDM (diffusion), A2P (VQ + diffusion), SHOW (VQ autoregressive, audio-only upper-body originally).
Ablations: The same SARAH pipeline but (a) using joint angles with IK, and (b) removing the VAE.

The scoreboard (with context):

Speed: SARAH runs at over 300 FPS, about 3× faster than strong non-causal baselines (around 90 FPS). SHOW runs at 230 FPS but with much worse physical and social quality.
FGD: SARAH gets 1.28. That’s like earning an A- while MDM gets 3.48 (a D) and A2P gets 2.01 (a C). Retrieval NN scores look good (0.90) because it copies real data—but it can’t respond to new user paths.
Foot Slide: SARAH hits 0.01—basically clean shoes—while SHOW slides 27× more (0.27). This shows the Euclidean representation and learned prior help keep feet planted.
Wrist Variance: Ground truth is ~138. SARAH at ~105 is lively and natural. MDM at ~61 is under-expressive (muted gestures), A2P ~69 also damped, SHOW ~65 struggles with fine dynamics.
Head Angle (gaze alignment): SARAH at 0.83 matches or edges out the best non-causal method (MDM ~0.81) even without future frames. NN reaches 0.59 and Random 0.28, showing the value of true generation over retrieval.

Surprising findings:

Causal can match non-causal gaze: Even without future user positions, SARAH learns to orient responsively, challenging the belief that you must peek ahead for good alignment.
Moderate gaze guidance can improve overall quality: Setting the gaze dial around 0.8 not only boosts alignment but can also slightly improve motion metrics, likely by providing extra spatial grounding.
VAE matters for distributional quality and speed: Removing it raises FGD and halves FPS, confirming that latent compression is key for fast, realistic generation.

Case-by-case insights:

Retrieval looks good on some metrics but fails on social reactivity: It can’t match the user’s unique path or timing.
SHOW (audio-only) lacks spatial awareness by design: Without the user’s position, it can’t face properly, and its modular VQ design struggles with full-body ground contact.
Joint-angle ablation hurts precision: Angle ambiguity reduces gaze accuracy and increases foot sliding, supporting the choice of Euclidean representation.

Bottom line: SARAH sets a new bar for real-time, spatially-aware conversational motion—quality on par with (or better than) slower, non-causal systems, plus user-controllable gaze, all streaming in real time.

05Discussion & Limitations

Limitations (be specific):

Data bias: If certain distances, walking patterns, or gaze habits are rare in training, the model may underperform there.
Control is limited mostly to gaze: Gesture style, locomotion speed, and personal mannerisms aren’t yet exposed as easy dials.
Two-person focus: Multi-party conversations need architectural updates (e.g., multiple user positions, social attention across partners).
Audio features: While streaming is enforced at inference, training may rely on features that originally used broad context; careful streaming setup is required to remain fully causal live.

Required resources:

A VR/AR device (or camera) to track the user’s head position in real time.
Two audio streams (user and agent) and a speech feature extractor.
A GPU for training; for inference, a modern device can run the real-time pipeline.

When NOT to use it:

Crowds or multi-speaker group chats where the agent must track several people at once.
Action-heavy tasks (e.g., dancing, sports) where footwork and timing differ from conversational norms.
Situations needing strong style imitation or cultural gestures beyond what the dataset covers.

Open questions:

How to generalize beyond dyads to triads or groups while maintaining speed and natural social behavior?
How to add more controls (gesture energy, posture openness, personal style) without hurting realism?
How to adapt on the fly to a user’s unique comfort with gaze and distance, learning preferences during the session?
How to combine language understanding with motion to reflect sarcasm, humor, or subtle social cues more deeply?

06Conclusion & Future Work

Three-sentence summary: SARAH is a real-time system that generates full-body, spatially-aware conversational motion by listening to both speakers and tracking the user’s position. It combines a causal transformer-based VAE (for fast, coherent latents) with a flow matching model (for smooth, expressive motion) and adds a user-controlled gaze dial to adjust eye contact on demand. On the Embody 3D dataset, it achieves state-of-the-art quality while running over 300 FPS and deploys live on VR headsets.

Main achievement: Decoupling learning from control—training on natural gaze behavior while allowing simple, real-time gaze steering—inside a fully causal, streaming motion pipeline that stays fast and realistic.

Future directions: Extend to multi-party conversations, add controls for gesture style and locomotion, adapt to individual users’ comfort over time, and integrate deeper semantic understanding for richer nonverbal communication.

Why remember this: SARAH shows that you don’t need to peek into the future to look socially smart—if you design the right compact space, real-time generator, and gentle control. It turns avatars from stiff talkers into responsive partners who move and look like they’re truly there with you.

Practical Applications

•VR tour guides that turn toward visitors, gesture clearly, and keep comfortable eye contact levels during museum or campus tours.
•Telepresence avatars for remote meetings that face the right person and show attentive listening and speaking gestures in real time.
•Language learning tutors that use natural body language and gaze to reinforce pronunciation and engagement.
•Customer service kiosks with avatars that orient to the customer’s position and adjust gaze to reduce social discomfort.
•Therapy and coaching simulations where controlled gaze and posture help practice social skills safely.
•NPCs in games that react to the player’s position and speech tone, making encounters feel alive.
•Classroom assistants in AR that move and look toward students who ask questions, encouraging participation.
•Training simulations (e.g., retail, hospitality) where role-play partners respond spatially and socially like real customers.
•Fitness or dance instructors that face you, demo moves with minimal foot sliding, and adjust attention based on your cues.
•Interactive exhibits and theme-park characters that keep the magic by staying responsive as guests move around.

Version: 1