EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

Wenjia Wang; Liang Pan; Huaijin Pi; Yuke Lou; Xuqian Ren; Yifan Wu; Zhouyingcheng Liao; Lei Yang; Rishabh Dabral; Christian Theobalt; Taku Komura

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

Intermediate

Wenjia Wang, Liang Pan, Huaijin Pi et al.2/26/2026

arXiv

Key Summary

•EmbodMocap is a low-cost, portable way to capture people moving inside real places using just two iPhones, so computers and robots can learn from real life instead of studios.
•The key trick is making both phones “agree” on the same real-world ruler and map, so humans and the scene are reconstructed at true size and in the same world coordinates.
•Two moving views beat one: dual views remove depth confusion (how far things are) and fix occlusions, giving much better 3D alignment than single-view or monocular models.
•They first build a true-scale 3D scene from one RGB-D scan, then record human motion with two synchronized iPhones, then align the phone paths to the scene, and finally refine body motion in the world frame.
•Compared to studio-grade optical motion capture, the dual-view iPhone setup cut motion errors by a lot (e.g., W-MPJPE ~73 mm vs. ~123 mm for a monocular model) and localized people to within about 5 cm in the scene.
•Their data improves feedforward monocular reconstruction models, powers physics-based character skills (like sit, climb, lie, support), and trains real humanoid robots via sim-to-real reinforcement learning.
•The system avoids suits, markers, or fixed rigs, so it works indoors and outdoors, keeping people’s natural look in videos.
•Limits include iPhone depth beyond ~5 m, scenes with lots of moving stuff (hurts SLAM), and very bright light (hurts COLMAP), but they suggest fixes like better localization tools.
•This work lowers the barrier for embodied AI, letting more teams collect high-quality, scene-aware motion data in the wild.
•Bottom line: two smartly calibrated phones can unlock rich, metric 4D human-scene data for perception, animation, and real robot control.

Why This Research Matters

Robots, AR/VR avatars, and game characters need real, true-to-size motion and scene data to act naturally. EmbodMocap makes that data cheap and portable by using two iPhones, removing the need for studios, suits, or markers. Dual views fix depth ambiguity and occlusions, so hands really land on tables and feet don’t slide through floors. Because everything is in one metric world frame, the same data works for perception, animation, and robot control. This opens the door to much larger, more diverse datasets collected in everyday places, not just labs. With better data, embodied AI can learn faster and perform safer, more reliable actions in our homes and cities.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you want to teach a robot to move around your living room without bumping into your furniture. You’d need to show it real people in real rooms, not just cartoons on a green screen.

🥬 The Concept: Embodied AI is about building agents (like robots or virtual characters) that can see, understand, and act in the real world.

How it works (big picture):
1. Show the agent examples of people moving in real spaces.
2. Reconstruct those people and the rooms in 3D so sizes and distances are correct.
3. Train the agent to copy and adapt those motions with physics so it doesn’t slip or fall.
Why it matters: Without real, true-to-size data, agents learn bad habits—like thinking a sofa is tiny or that a foot can pass through a table.

🍞 Anchor: Think of a robot learning to sit on a real couch. If the couch is the wrong size in its data, it will miss the seat.

🍞 Hook: You know how 3D movies look real because both eyes see from slightly different places? Now imagine trying to capture that 3D feel for a whole room and a moving person using normal phones.

🥬 The Concept: 4D human-scene data means 3D people and 3D rooms over time (the 4th dimension is time).

How it works:
1. Capture images and depths (how far things are) as someone moves.
2. Rebuild the room and the person’s body in 3D for every frame.
3. Keep everything in the same measurement system (meters) and the same world map so they match perfectly.
Why it matters: If people and rooms don’t align in the same world frame, hands won’t touch tables, and feet will slide through floors.

🍞 Anchor: Picture a video where a person sits on a chair: correct 4D capture makes the hips meet the seat, not float above or clip through.

🍞 Hook: Have you ever tried to guess how far away a mountain is in a photo? It’s hard because a single picture hides depth.

🥬 The Concept: Depth ambiguity means, from one view, you can’t tell exactly how far something is.

How it works:
1. With just one camera, the same 2D picture could come from many 3D positions.
2. Two cameras from different angles let you triangulate and solve the distance.
3. Align both cameras to the same world map to get real measurements.
Why it matters: If you get depth wrong, people look like they “slide” toward or away from the camera and don’t match the scene.

🍞 Anchor: Two kids on opposite sides of a room can point to the same lamp and agree where it is in 3D; one kid alone might guess wrong.

The world before: Capturing high-quality 4D human-scene data usually needed expensive mocap studios, fixed multi-camera rigs, marker suits, or LiDAR scanners. Internet videos weren’t enough because one view hides depth and occlusions (when one body part blocks another). Wearable sensors change how people look in RGB video and require careful syncing with 3D scenes.

The problem: We needed a cheap, portable way to collect accurate, true-to-size 4D data in everyday places—living rooms, gardens, stairs—without suits, markers, or fixed rigs.

Failed attempts: Monocular methods (one camera) estimate people pretty well but struggle with real-world scale and long sequences with moving cameras. Studio methods are precise but expensive and stuck indoors. Wearables help motion but complicate scene alignment and natural appearance.

The gap: No simple, scalable pipeline could reliably tie together a moving human and a moving camera into one true-scale world with both human motion and scene geometry captured together, outside a studio.

Real stakes: This affects how fast we can teach robots to help at home, make AR/VR avatars feel real, improve sports analysis, and power games and films that respect physics. If data is cheap and easy to collect anywhere, more people can build smarter embodied agents.

🍞 Anchor: Think about teaching a home robot to carry a box down real stairs. With real, accurate data of people on stairs, the robot learns where each step is, how high to lift its feet, and how to keep balance—no fancy studio required.

02Core Idea

🍞 Hook: You know how two friends can lift a couch better than one? Two phones can “lift” a 3D scene better than one, too.

🥬 The Concept: The aha! moment is to use two moving iPhones and make them agree on one true-to-size world map so both the person and the place are reconstructed together, accurately.

How it works:
1. Scan the static scene once with one iPhone to set the world’s meter scale and “up” direction.
2. Record the human with two synchronized iPhones from different angles (dual RGB-D).
3. Align both phone paths to the scene’s world using feature matching and optimization.
4. Triangulate body keypoints and refine the body motion so it sits correctly in the scene.
Why it matters: This dual-view setup crushes depth ambiguity, fixes occlusions, and ties motion to the real scene, all with cheap phones.

🍞 Anchor: It’s like building a Lego city (the scene) with a measured ruler first, then adding a Lego minifigure (the person) that fits exactly on the bench at the right spot.

Three analogies:

Binocular vision: Like your two eyes, two iPhones see from different angles, so the brain (algorithm) can judge distance and true size.
Treasure map + compass: First make the map (scene) with north-up and a scale bar, then mark where the traveler (person) walks so everything lines up.
Team photography: Two photographers follow the action, then a scrapbook editor matches both albums to the same trip timeline and city map.

Before vs. After:

Before: One camera guesses depth; studio rigs are pricey; wearables change appearances; stitched results can drift.
After: Two handheld phones align humans and scenes metrically, outside studios, with good accuracy and low cost.

Why it works (intuition):

Two views break depth ties: the same pixel seen from two positions pins down a 3D point.
World anchoring: The early scene scan gives a “meter stick” and gravity-up, so later recordings can be snapped into place.
Joint optimization: Combining point tracks, scene alignment, and reprojection checks keeps everything self-consistent.

Building blocks (each with a simple sandwich):

🍞 Hook: Imagine measuring a room once so every later photo knows the real size. 🥬 The Concept: Metric-scale world frame is a shared 3D coordinate system in meters with a fixed “up” direction.

Steps: (1) Scan room with RGB-D; (2) estimate camera path with SLAM; (3) fuse depths into a mesh at real scale.
Why it matters: Without this, motion and scene won’t match size or gravity. 🍞 Anchor: If the table is 0.75 m tall in the map, the person’s hand lands right on it.

🍞 Hook: Two friends filming a soccer play from different sides make it easier to see where the ball really is. 🥬 The Concept: Dual RGB-D calibration is making both phones agree on the same map and clock.

Steps: (1) Sync videos; (2) find shared scene features; (3) solve for the rigid transform into the scene’s world; (4) refine with multi-view constraints.
Why it matters: It removes depth confusion and drift. 🍞 Anchor: Both phones’ paths line up on the same field so the player’s run looks smooth and true to size.

🍞 Hook: Dot-to-dot puzzles become 3D if you look from two sides. 🥬 The Concept: 3D keypoint triangulation estimates joint positions in 3D from their 2D spots in two views.

Steps: (1) detect 2D joints; (2) use known cameras; (3) solve the 3D point that reprojects best in both views.
Why it matters: It gives reliable skeletons tied to the world. 🍞 Anchor: The left wrist appears at pixel A in phone 1 and pixel B in phone 2; together they reveal one 3D wrist point.

🍞 Hook: Think of a bendy action figure whose pose you adjust until its shadow matches the photo. 🥬 The Concept: SMPL body fitting refines a realistic human mesh to match keypoints, masks, and depth.

Steps: (1) start from an initial pose; (2) minimize errors to 3D joints and silhouettes; (3) keep motion smooth.
Why it matters: It turns noisy estimates into a clean, realistic body moving in the real scene. 🍞 Anchor: The knees line up with the pant shapes and the feet press the floor without slipping.

One gentle formula to anchor intuition:

We align a path with a simple rule: $Y \approx s R X + t$ . For example, if $s=2$ , $R$ is the identity, $t=\begin{pmatrix}1 \\ 0 \\ 0\end{pmatrix}$ , and $X=\begin{pmatrix}1 \\ 1 \\ 1\end{pmatrix}$ , then $Y=2\cdot \begin{pmatrix}1 \\ 1 \\ 1\end{pmatrix}+\begin{pmatrix}1 \\ 0 \\ 0\end{pmatrix}=\begin{pmatrix}3 \\ 2 \\ 2\end{pmatrix}$ .

This core idea lets cheap devices capture rich, true-to-life 4D data for perception, animation, and robots.

03Methodology

At a high level: Input (two synchronized iPhone RGB-D videos + one scene scan) → Stage I (build true-scale scene) → Stage II (per-view cameras and human priors) → Stage III (align both phone paths to the scene) → Stage IV (triangulate and refine human motion in world coordinates) → Output (metric 4D human + scene).

Stage I — Scene Reconstruction (set the world’s ruler)

🍞 Hook: You know how builders set a level and a tape measure before they start? We do that for the room first. 🥬 The Concept: Metric-scale scene reconstruction creates a real-sized 3D mesh and a shared world frame.

What happens:
1. Record one RGB-D scan of the empty scene with an iPhone.
2. Use a SLAM tool (SpectacularAI) to recover the phone’s path and camera parameters in a gravity-aligned, meter-based world.
3. Clean and fuse depth maps (e.g., with TSDF fusion) into a dense scene mesh.
4. Extract features and build a sparse COLMAP database to help align future videos.
Why this step exists: It locks in meters and “up,” so later people and cameras can snap into the same world. 🍞 Anchor: After this step, “the floor is here, 0 m high” and “the table is 0.75 m tall,” ready for motion to be added.

Stage II — Sequence Processing (prep both camera streams)

🍞 Hook: Think of two camcorders filming the same play and a clapperboard to sync them. 🥬 The Concept: Dual RGB-D sequence processing calibrates each phone per frame and extracts human hints.

What happens:
1. For each view: get intrinsics/extrinsics per frame from SLAM.
2. Detect the person (YOLO), 2D keypoints (ViTPose), and the mask (SAM2), and refine depth (PromptDA).
3. Sync the two videos using a laser pointer cue (find the same event frame).
Why this step exists: Clean per-frame camera geometry and human cues are needed for accurate alignment later. 🍞 Anchor: Now each frame knows where the camera is, where the person is, and which pixels belong to the person.

Stage III — Sequence Calibration (make both phones agree with the scene)

🍞 Hook: When merging two photo albums from the same trip, you match landmarks so the maps align. 🥬 The Concept: COLMAP registration and multi-view optimization align both phone trajectories into the scene’s world.

What happens:
1. Initial alignment: register each phone’s images to the scene’s sparse COLMAP model to get a rough world pose.
2. Solve a rigid transform from each SLAM path to the COLMAP world using a simple formula $Y \approx s R X + t$ (scale, rotation about z, translation). For a concrete example, suppose $s=1.1$ , $R$ rotates $10^{\circ}$ about z, $t=\begin{pmatrix}0.2 \\ -0.1 \\ 0\end{pmatrix}$ , and $X=\begin{pmatrix}2 \\ 0 \\ 1\end{pmatrix}$ . Rotating $X$ by $10^{\circ}$ about z gives approximately $\begin{pmatrix}1.97 \\ 0.35 \\ 1\end{pmatrix}$ . Then $Y \approx 1.1\cdot\begin{pmatrix}1.97 \\ 0.35 \\ 1\end{pmatrix}+\begin{pmatrix}0.2 \\ -0.1 \\ 0\end{pmatrix} \approx \begin{pmatrix}2.37 \\ 0.29 \\ 1.10\end{pmatrix}$ .
3. Refine with three kinds of constraints:
  - Point tracking: back-project tracked pixels with depths from both views and make their 3D points agree.
  - Chamfer distance: match each view’s local scene point cloud to the global mesh so backgrounds align.
  - Reprojection / bundle adjustment: points matched by COLMAP should reproject correctly in each aligned camera.
Why this step exists: Without it, small misalignments turn into big errors in depth and position over time. 🍞 Anchor: After optimization, a person touching a table in view 1 also touches the same table spot in view 2 and in the world mesh.

A friendly distance formula we use:

Chamfer distance measures how close two point sets are: for sets $A$ and $B$ , $d_{ch}(A,B)$ adds the nearest neighbor distances in both directions. Example: if $A=\{(0,0),(1,0)\}$ and $B=\{(0,0.1),(1.1,0)\}$ , nearest distances are $0.1$ and $0.1$ from $A$ to $B$ , and $0.1$ and $0.1$ from $B$ to $A$ , so $d_{ch}(A,B)=0.4$ .

Stage IV — Motion Optimization (perfect the human in the world)

🍞 Hook: Like adjusting a puppet so its joints match the shadows from two spotlights. 🥬 The Concept: 3D keypoint triangulation and SMPL fitting refine the person’s full pose and position.

What happens:
1. Triangulate 3D joints from both views using their known cameras, choosing 3D points that reproject best to both 2D detections.
2. Fit the SMPL body model to these 3D joints, the silhouette, and the depth, with smoothness and motion priors.
Why this step exists: It removes jitters, fixes foot skating, and locks contacts to surfaces. 🍞 Anchor: Feet stay planted on the floor when standing still; hands land on the tabletop, not inside it.

The “secret sauce”

Two moving views crush depth ambiguity and occlusion.
A prebuilt metric scene gives a stable ruler and gravity, avoiding drift.
Multi-constraint optimization (tracks + Chamfer + reprojection) cross-checks everything, so errors can’t easily sneak through.

A simple combined-loss idea:

Suppose the total calibration loss is $L = 0.5\,L_{track} + 0.3\,L_{ch} + 0.2\,L_{reproj}$ . If $L_{track}=4$ , $L_{ch}=1$ , $L_{reproj}=2$ , then $L=0.5\cdot4+0.3\cdot1+0.2\cdot2=2+0.3+0.4=2.7$ .

Monocular baseline finetuning (using the captured pairs)

🍞 Hook: If you give a student the answer key with the steps, they learn faster. 🥬 The Concept: Monocular human-scene reconstruction model finetuning means teaching single-view models with our paired RGB-D, cameras, and SMPL so they predict in the right world frame and scale.

What happens: Align chunked camera trajectories (e.g., with Procrustes), estimate a metric scale by comparing depths (e.g., $s = \mathrm{median}(z_{\pi^3}/z_{SMPL})$ ; for example, if depths are $(2.0, 1.5, 3.0)$ and $(1.0, 0.8, 1.5)$ , ratios are $(2.0, 1.875, 2.0)$ and the median is $2.0$ ), then supervise predictions with our ground-truth pairs.
Why this matters: Single-view models become more stable under camera motion and predict true sizes. 🍞 Anchor: One camera video can now place the person correctly on the floor, not floating or sinking, even as the camera walks around.

04Experiments & Results

The test: What did they measure and why?

Geometric accuracy: How close are reconstructed people to ground truth (studio optical mocap fitted to SMPL)? This checks if dual-view really beats single-view and monocular.
Alignment to scene: Are people actually where the scene says they should be (e.g., touch a table where the table is)? This tests world-frame consistency.
Monocular finetuning: Do feedforward models get better on long, real sequences when trained with our paired data?
Physics-based skills: Can policies trained on our motions succeed at interaction tasks (follow, climb, sit, lie, prone, support) with realism and diversity?
Sim-to-real robot control: Can a humanoid robot copy motions (including hand-ground contacts) learned from our reconstructions?

The competition: Who/what did they compare against?

Monocular model GVHMR for motion recovery.
Single-view optimization variants (use only one of the two iPhones).
Optical motion capture ground truth in a studio (gold standard).

The scoreboard (with context):

Optical studio comparison over 9,420 frames: dual-view iPhone method achieved much lower motion error (e.g., W-MPJPE around 72.86 mm) than the monocular model (about 123.44 mm). Think of it as going from a solid B- to an A grade on precise joint placement.
Root translation errors grew slowly with longer clips for dual-view but much faster for single-view/monocular. Over longer chunks (500–1000 frames), dual-view pulled even further ahead, showing it’s more stable when videos get long.
Scene alignment: dual-view localized people to within about 5 cm in the scene (e.g., hand-to-table contact matched), while single-view could be off by over 30 cm—like missing the edge of a seat.

Ablations that make numbers meaningful:

Tracking loss (tie both views together) was critical: removing it caused big drops in mask overlap and higher keypoint errors.
3D keypoint loss (triangulated joints) helped crush depth ambiguity better than 2D-only reprojection.
Chamfer distance kept local reconstructions glued to the global mesh; without it, backgrounds drifted.
Smoothness reduced jitter (foot skating), important for physical plausibility.

Monocular finetuning results:

Using our paired RGB-D + cameras + SMPL annotations to finetune models led to improved errors on EMDB long sequences (e.g., WA-MPJPE and W-MPJPE both decreased; root trajectory error improved). Translation: single-view models placed people better in the real world after training on our data.

Physics-based character animation:

Common skills (follow, climb, sit) reached near-100% success across data sources, but our data provided strong diversity (APD), which is helpful for robust policies.
Harder skills (lie, prone, support) showed clear gains for our data over monocular estimates, especially “support,” where hands must bear weight while feet remain close: monocular-trained policies succeeded far less often (e.g., ~20%), while ours were much higher (e.g., ~66%). That’s like being able to do a stable push on a table vs. wobbling and failing.

Scene-aware motion tracking:

Policies trained per scene on our captured clips achieved high success rates when tracking long sequences with realistic contact and fewer artifacts (less interpenetration, less floating). This suggests our data is “simulation-ready.”

Real-world robot control:

A compact humanoid robot (21 DoF, 80 cm tall) could imitate motions, including contact-rich moves like cartwheels, after sim-to-real training on our reconstructions (BeyondMimic). That’s a big sign the data preserves the contact cues robots need.

Surprising findings:

Two moving phones, when carefully calibrated to a prebuilt metric scene, nearly close the gap to much pricier capture setups for many tasks.
The dual-view approach not only fixes occlusions but also sharply reduces depth error in the camera’s forward direction—single-view’s Achilles’ heel.
A small amount of optional human input (e.g., marker contacts at start/end frames) can squeeze out the last centimeters of alignment when needed, but the pipeline already performs well without heavy manual work.

05Discussion & Limitations

Limitations (honest and specific):

iPhone LiDAR range: Depth becomes unreliable beyond about 5 meters, so very large rooms or outdoor fields at a distance are hard to capture.
Moving-scene challenge: If lots of objects move (crowded areas, traffic), SLAM and matching can fail or wobble, hurting alignment.
Harsh lighting: Extremely bright or low-texture surfaces can break COLMAP registration, making initial alignment tricky.
Dual-operator logistics: Two people filming with a good angle (ideally 60–120°) is easy but still needs coordination.

Required resources:

Two iPhones with RGB-D (LiDAR-capable preferred), a simple laser pointer for sync, and a laptop/desktop with standard vision tools (SLAM SDK, COLMAP, depth refinement, keypoint/mask detectors).
Some compute time for optimization (multi-view tracks, Chamfer alignment, SMPL fitting). A modern GPU helps but isn’t strictly required for every step.

When NOT to use this:

Very large open spaces where subjects stay >5 m from both phones most of the time—depth will be too sparse.
Scenes full of fast-moving crowds or flickering lights—feature tracking and SLAM can become unstable.
Situations demanding millimeter-level ground-truth absolute accuracy (e.g., medical-grade motion capture)—use optical mocap instead.

Open questions:

Can we further automate synchronization and reduce human input (e.g., phone-to-phone timecode or audio clap syncing)?
How robust can we make alignment under extreme lighting or textureless environments—would integrating H-Loc or learned features push reliability higher?
Could a third casual camera (still handheld) offer big gains, or do returns diminish after two views?
How well does this scale to groups of people interacting? What about dynamic scenes where large objects move (e.g., doors, carts)?
Can we learn better priors from the captured data itself to speed up optimization and reduce jitter even more?

06Conclusion & Future Work

Three-sentence summary:

EmbodMocap shows that two moving iPhones, smartly calibrated to a metric scene, can capture accurate 4D humans and scenes in everyday places without suits, markers, or fixed rigs.
Its dual-view design removes depth ambiguity and ties human motion to real geometry, boosting monocular reconstruction, physics-based skills, and even real robot control.
This portable, low-cost pipeline lowers the barrier to embodied AI, enabling large-scale, real-world data collection that respects size, gravity, and contact.

Main achievement:

Seamlessly unifying dual RGB-D videos and a prebuilt metric scene into one world frame—then refining human motion within it—delivers studio-like alignment from consumer devices.

Future directions:

Add more robust localization (e.g., H-Loc) for tough lighting; integrate auto-sync apps; explore tri-camera capture; extend to multi-person; and learn stronger motion/scene priors from the growing dataset.

Why remember this:

It flips the script from “studios or bust” to “bring two phones,” unlocking scalable, authentic, metric 4D data for perception, animation, and robots. With EmbodMocap, the real world becomes your motion lab.

Practical Applications

•Training home robots to navigate, sit, pick, place, and support their weight on furniture safely.
•Creating realistic AR avatars that respect room geometry and contact surfaces in real time.
•Improving sports coaching by capturing true-scale athlete motions on fields without studio gear.
•Building lifelike game characters that move and interact physically with game environments.
•Pre-visualizing film scenes with accurate human-scene contact from quick location scans.
•Rehabilitation monitoring with precise, context-aware motion capture in clinics or at home.
•Workplace safety analysis by studying how people move around machines and obstacles on-site.
•Cultural heritage documentation of human interactions within historic spaces (e.g., rituals, crafts).
•Education labs where students collect and analyze motion-and-scene data using just phones.
•Smart home testing: evaluate how users interact with new furniture or layouts before deployment.

Version: 1