ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors

Zihao Huang; Tianqi Liu; Zhaoxi Chen; Shaocong Xu; Saining Zhang; Lixing Xiao; Zhiguo Cao; Wei Li; Hao Zhao; Ziwei Liu

ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors

Beginner

Zihao Huang, Tianqi Liu, Zhaoxi Chen et al.3/4/2026

arXiv

Key Summary

•ArtHOI is a new zero-shot method that makes people and everyday articulated objects (like doors, drawers, and fridges) move together realistically using only a single generated video as guidance.
•Instead of guessing motion all at once, ArtHOI first rebuilds how the object’s parts move (the hinge or slide) and then adjusts the human’s motion to match that object motion.
•It uses optical flow (how pixels move between frames) to split the object into moving parts (like a door) and still parts (like a frame).
•By treating the video as supervision for a 4D reconstruction, the system creates time-coherent 3D scenes that respect contacts and avoid hand–object penetrations.
•A special two-stage pipeline avoids the confusion of monocular video by first locking in the object’s articulation, then refining the human’s movement around it.
•Across tasks like opening fridges, cabinets, or microwaves, ArtHOI shows higher contact consistency, fewer penetrations, and more accurate articulation than prior methods.
•It generalizes without needing 3D training data, making it far more practical for VR/AR content creation and robotics training.
•User studies strongly prefer ArtHOI’s realism and contact quality over leading baselines.
•It also works well for rigid objects, not just articulated ones, showing broad applicability.
•Overall, ArtHOI bridges video generation and geometry-aware reconstruction to produce physically grounded human–object interactions.

Why This Research Matters

Robots need to open and close everyday articulated objects safely, and training them without expensive 3D capture is a huge win. Game and VR creators want realistic, contact-rich interactions without hand-animating every scene, and ArtHOI gives them that from just a text prompt. In education and simulation, physically plausible interactions reduce bad habits like hand-through-door ghosting. Because ArtHOI is zero-shot, it scales easily to many objects and scenes without custom datasets. This can speed up prototyping and content creation while improving realism and safety. It also helps research on action understanding by producing clean, labeled interactions. In short, better, cheaper, more realistic motion with fewer headaches.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you watch a cooking video, you can tell which parts of the fridge door are fixed and which part swings open? Your eyes use motion to figure out structure. Computers want to do that too, especially when making people and objects move together in 3D for games, VR, and robots.

Before this research: AI could make people move, and it could place 3D objects, but making a person correctly open an articulated object (like a door on hinges) was hard without special 3D recordings. Many systems handled only rigid objects as if the entire thing moved together. That means they could push a box but not open a cabinet door realistically. Methods that did look good usually needed expensive motion-capture and multi-camera setups, which don’t scale.

The problem: If you only have a single video (monocular) to learn from, it’s tricky to tell what’s moving: the person, the object’s door, the camera, or all three. If you try to optimize both the human motion and the object’s articulation at the same time from that 2D video, the learning signals fight each other. This causes unstable training and messy results (like hands sliding through doors or doors stretching apart).

People tried: 1) End-to-end generation from video diffusion models (VDMs): good at making pretty videos, but they often treat complex objects like simple rigid pieces and don’t reconstruct proper 4D geometry, so contacts and hinges aren’t reliable. 2) 4D reconstructions from a single video: great for rigid scenes, but they weren’t designed for part-wise motion (hinges, sliders). 3) Articulated object methods: often need known templates, categories, or multiple views; and they usually ignore how humans push or pull parts, missing strong clues from contact.

The gap: We need a way to turn a single video (even a generated one) into a physically meaningful 4D scene where articulated parts are identified and moved correctly, and the human motion coordinates with those parts—without any 3D ground truth.

The stakes: - For robots, learning to open a fridge or a cabinet safely and efficiently is a big deal (think kitchens, warehouses, elder care). - For VR/AR and games, creators want believable interactions without hand-animating every frame. - For safety and training simulations, you want hands to actually grasp handles and not pass through them. Doing all of this with zero-shot (no 3D supervision) means it’s far cheaper and more flexible.

Here’s the twist this paper proposes: Treat the 2D video (from a text prompt via a VDM) as supervision for a 4D reconstruction process. First, reconstruct how the object articulates (what moves, what stays) using motion cues from optical flow; then, fit the human’s movement to that reconstructed object so hands meet handles and feet don’t slide unrealistically. This decoupling mirrors how we humans reason: we first figure out how the door works, then we move our arm to match.

02Core Idea

🍞 Top Bread (Hook): Imagine building a Lego door first (getting the hinge right), and only after it swings correctly do you place a minifigure’s arm to grab and open it smoothly.

🥬 The Concept: The key idea is to turn video-based interaction synthesis into a two-step 4D reconstruction: first reconstruct the object’s articulation from motion, then refine the human motion to match that object.

What it is (one sentence): ArtHOI reconstructs a time-varying 3D scene (4D) from a single video by first solving the object’s part motions and then aligning the human to those motions.
How it works (recipe): 1) Generate a short video from a text prompt with a video diffusion model. 2) Use optical flow to separate moving object parts (like a door) from static parts (like the frame). 3) Recover the object’s articulation (hinge/slide) with kinematic constraints. 4) With the object locked in, adjust the human’s 3D motion so hands meet handles and avoid collisions. 5) Render the final 4D scene.
Why it matters: Without separating object articulation first, the system confuses what’s moving (hand vs. door), producing unstable, unrealistic interactions.

🍞 Bottom Bread (Anchor): Think: first confirm the cabinet door truly swings on its hinges; then have the person’s hand reach, grasp, and pull it open—no ghost hands or wobbly doors.

Explained three ways:

Engineering analogy: First calibrate the machine’s moving parts, then program the operator’s motions to match them—rather than guessing both at once.
Detective analogy: First figure out which clues (pixels) belong to moving door vs. fixed frame using motion patterns, then reconstruct the human’s actions to fit the facts.
Sports analogy: First set the hoop’s position and angle; then practice your shooting form to that fixed target.

Before vs. after:

Before: End-to-end generation mixed human and object motion guesses, often causing hand–object mismatch and broken hinges.
After: Decoupled reconstruction yields solid object articulation first, then human motion that makes contact at the right time and place.

Why it works (intuition, no equations): Motion (optical flow) is a strong signal for what part belongs to a moving segment; hinges create predictable motion near boundaries (quasi-static points). Locking down that structure gives the human optimizer reliable 3D contact targets, removing monocular confusion.

Building blocks (each introduced with a mini sandwich):

🍞 You know how leaves that move in the wind show you which branches they’re attached to? 🥬 Optical Flow: It measures how image pixels move between frames. How: track points and compute their displacement across frames; larger motion often means the part moves. Why it matters: It reveals which parts are dynamic (door) versus static (frame). 🍞 Example: A door panel’s pixels shift a lot; the frame’s pixels hardly move.

🍞 Picture reading a comic strip: you see each panel and remember what changed. 🥬 Monocular Video Priors: Use a single video as a guide for 3D reconstruction. How: treat frames as supervision; no extra 3D labels needed. Why: It unlocks zero-shot learning from just a generated (or real) video. 🍞 Example: A 5-second clip of “open the fridge” guides the whole 3D motion.

🍞 Think of a door hinge that can’t stretch or detach. 🥬 Kinematic Constraints: Rules that enforce how parts can move (e.g., rigid parts, rotation around a hinge). How: penalize breaking distances near the hinge and sudden jerks. Why: Without them, parts drift or distort. 🍞 Example: The door edge stays the same distance from the hinge pins as it swings.

🍞 Like a flipbook that is 3D and changes over time. 🥬 4D Reconstruction: Build a 3D scene for every moment in time. How: optimize shapes and poses per frame so renders match the video. Why: You get consistent geometry and motion, not just pretty pixels. 🍞 Example: A 3D fridge and a 3D human moving correctly over 60 frames.

🍞 Do the puzzle border first, then the inside. 🥬 Decoupled Reconstruction Pipeline: Two stages—first object articulation, then human motion. How: solve object with motion cues and constraints; freeze it; refine human to match. Why: It avoids the tug-of-war of joint optimization from monocular video. 🍞 Example: Nail the cabinet’s hinge motion; then place the hand’s reach and pull.

🍞 If a crowd moves and one person stands still, you can spot groups. 🥬 Flow-based Part Segmentation: Split object into moving vs. static regions using flow and SAM masks, then map to 3D. How: track points, classify by motion, prompt SAM for dense masks, back-project to 3D Gaussians. Why: It reveals the door panel vs. frame. 🍞 Example: The panel cluster shows big flow; the frame cluster shows tiny flow.

🍞 When sculpting with many soft clay blobs. 🥬 3D Gaussian Splatting: Represent shapes as many fuzzy 3D blobs to render and optimize quickly. How: each blob has position, size, color; projecting them renders the scene. Why: It’s fast and differentiable for learning from video. 🍞 Example: Hundreds of blobs form a door; moving some blobs swings the door.

🍞 Like tracing a shadow to find where a hand touches a surface. 🥬 Inverse Rendering: Adjust 3D scene so its render matches the video. How: compare renders to frames; tweak parameters to reduce the difference. Why: It turns 2D evidence into 3D motion. 🍞 Example: If the render shows the hand behind the door but video shows it in front, move the hand forward.

🍞 A posable action figure with defined joints. 🥬 SMPL-X Body Model: A standard way to represent a 3D human with joints and shape. How: change pose parameters to move hands/feet realistically. Why: It keeps human motion natural while fitting contact. 🍞 Example: Bend the elbow and open the fingers to reach a handle.

🍞 A toolbox that includes rotate-and-translate moves. 🥬 SE(3) Transform: The math for 3D rotations and translations. How: apply a rotation and a shift to move a rigid part each frame. Why: It moves doors and drawers like real-world parts. 🍞 Example: Rotate the door 30° around its hinge and shift 0.01 m if needed.

🍞 A loose hinge screw needs tightening to keep the door aligned. 🥬 Quasi-static Binding: Pair almost-still points near the hinge to keep moving and static parts tightly linked. How: detect low-motion points on the moving part and bind them to nearby static points. Why: Prevents gaps and drifting at the joint. 🍞 Example: The door edge stays snug to the frame line while swinging.

🍞 Tag a friend’s hand on a photo and find the closest door pixel to mark contact. 🥬 Contact Loss: Pull hand joints toward nearby object surface points when contact is likely, and penalize penetrations. How: detect overlap regions in 2D, lift to 3D using object depth, then guide the hand there. Why: Ensures grasp and push look real. 🍞 Example: The fingertip is steered to the handle’s 3D spot as the door opens.

03Methodology

At a high level: Text prompt → Video diffusion model makes a short video → Stage I: reconstruct the object’s articulation (what moves, what doesn’t) → Stage II: refine human pose to touch the right spots at the right times → Output a 4D, physically plausible interaction.

Key math we’ll reference (with simple numbers after each):

Video as frames: $V = \{I(t)\}_{t=1}^T$ . Example: If $T=3$ , then $V=\{I(1), I(2), I(3)\}$ are 3 frames.
Object transform per frame: $T_d(t) = [R_d(t),\; t_d(t)]$ . Example: Rotate $R_d(5)$ by $30^\circ$ around $z$ and translate $t_d(5) = \begin{pmatrix}0.00\\0.05\\0.00\end{pmatrix}$ meters.
Gaussian position update: $\mu_o^i(t) = w_d^i\, T_d(t)\, \mu_o^i(0) + w_s^i\, \mu_o^i(0)$ . Example: If $w_d^i=0.8$ , $w_s^i=0.2$ , $T_d$ shifts $\mu_o^i(0)$ by $\begin{pmatrix}0\\0.1\\0\end{pmatrix}$ , then $\mu_o^i(t)=0.8\cdot\begin{pmatrix}0\\0.1\\0\end{pmatrix}+0.2\cdot\begin{pmatrix}0\\0\\0\end{pmatrix}=\begin{pmatrix}0\\0.08\\0\end{pmatrix}$ .
Stage I loss (object): $L = L_{rec} + \lambda_a L_a + \lambda_s L_s + \lambda_{tr} L_{tr}$ . Example: If $L_{rec}=2.0$ , $L_a=0.5$ , $L_s=0.2$ , $L_{tr}=0.4$ , and $(\lambda_a,\lambda_s,\lambda_{tr})=(0.05,1.0,2.0)$ , then $L=2.0+0.05\cdot0.5+1.0\cdot0.2+2.0\cdot0.4=2.0+0.025+0.2+0.8=3.025$ .
Contact penalty: $L_c = \sum \max(0,\delta - d_{vq})$ . Example: With $\delta=0.01$ m, and distances $d_{vq}=[0.008,0.012]$ , terms are $[0.002,0]$ , sum $=0.002$ .
Human-to-contact keypoints: $L_k = \sum_{t}\sum_{j\in K_t} \lVert J_j(\theta(t)) - K_j(t) \rVert^2$ . Example: If one fingertip is $0.03$ m from target, contribution is $0.03^2=0.0009$ .

Stage I: Object articulation from video motion

Flow-based part discovery (Flow + SAM + back-projection)

What happens: Track many 2D points across frames (optical flow). Classify points with big motion as dynamic (panel) and tiny motion as static (frame). Use those points to prompt a segmentation model (SAM) for accurate masks. Map masks to 3D Gaussians by assigning pixels to nearby projected Gaussians.
Why this step: Without clean part labels, the system can’t learn where the hinge is or which blobs should move.
Example: Door panel pixels move ~12 px; frame pixels move ~1 px. Dynamic > 5 px → panel; $static ≤ 2$ px → frame.

Quasi-static binding near the hinge

What happens: Among dynamic points, find “almost-still” ones near the hinge (low motion percentile). Pair each such moving-side Gaussian with a nearby static Gaussian to preserve distances.
Why: Prevents the door panel from drifting away from the frame at the joint.
Example: A dynamic Gaussian edge point at (0.50, 0.00, 0.90) m binds to a static point at (0.48, 0.00, 0.90) m; their distance stays ~0.02 m across frames.

Articulation optimization with kinematic constraints

What happens: Optimize $T_d(t)$ so renders match the video (reconstruction), dynamic Gaussians follow tracker positions (tracking), distances between bound pairs stay constant (articulation), and motion is smooth (smoothness).
Why: Each loss prevents a specific failure: $L_r$ ec avoids visual mismatch, $L_t$ r ties motion to observed flow, $L_a$ keeps the hinge tight, $L_s$ stops jitter.
Example: If tracked 2D corner moves right by 15 px, the optimizer rotates the door so its projection also shifts ~15 px.

Secret sauce in Stage I: Using motion (flow) as the main geometric cue to separate parts, then binding near-hinge Gaussians to lock the articulation. This leverages what monocular video is best at—2D motion—without needing 3D labels.

Stage II: Human motion refinement on a fixed articulated object

Find likely contact frames and regions

What happens: Detect frames where $T_d(t)$ changes noticeably (the door is moving). Compute the overlap of human masks with the object silhouette; where the human appears in front of the object, mark as potential contact.
Why: With just one camera, this is a solid hint that the hand is touching or closely approaching the surface.
Example: When opening begins, a cluster around the right hand overlaps the handle area.

Lift 2D contact hints to 3D targets

What happens: For each 2D joint in the contact region, pick the nearest dynamic object Gaussians in image space, choose the closest-in-depth Gaussian, and use its 3D position (slightly offset toward camera) as a contact target $K_j(t)$ .
Why: This creates concrete 3D goals for fingertips/wrists to reach, solving monocular ambiguity.
Example: Fingertip at (u=420,v=300) maps to a handle Gaussian at 3D (0.62, 0.98, 0.85) m.

Optimize SMPL-X pose to match contacts, stay natural, and avoid collisions

What happens: Minimize a sum of losses: human reconstruction (match video/mask), contact $L_k$ (joints to targets), priors (stay near initial VDM pose), foot sliding penalty (keep foot planted during contact), smoothness, and collision penalty $L_c$ to avoid penetrating the object.
Why: Balances realism (match video) with physics (contacts, no penetration) and natural motion (priors, smoothness).
Example: If the wrist is 4 cm behind the handle, contact loss pulls it forward while collision loss prevents it from cutting through the door.

Secret sauce in Stage II: Because the object articulation is already stable, the human optimizer gets unambiguous 3D contact targets and doesn’t fight with changing object geometry. This breaks the monocular deadlock that plagued joint optimization.

Putting it together

Input: a short video (often generated from a text prompt like “open the fridge”).
Output: a 4D scene where the door swings with correct articulation and the human hand reaches, grasps, and pulls naturally, with feet planted and minimal penetration.
Why it’s robust: Motion cues solve part discovery; hinge bindings stabilize articulation; fixed object geometry gives the human solver clean contact targets.

04Experiments & Results

The test: Do we get realistic, physically plausible interactions where articulated parts move correctly and hands make proper contact? And do we achieve this better than previous methods that either need 3D data or assume rigid objects?

What they measured and why:

X-CLIP Score (text–motion match): Higher means the generated sequence better aligns with the prompt like “open the fridge.”
Motion Smoothness: Lower variation in joint speeds means steadier, more natural motion (interpreted carefully; too-low can also mean not interacting!).
Foot Sliding: Lower is better; planted feet should stay put on the ground.
Contact%: Higher means hands contact objects more consistently across frames.
Penetration%: Lower is better; hands should not pass through doors and drawers.
Rotation errors (for object articulation): Lower angle errors mean the recovered hinge/slide motion matches ground truth much better.

The competition:

TRUMANS, LINGO, CHOIS: strong baselines but not zero-shot and/or not designed for articulated parts.
ZeroHSI: a zero-shot 4D method that mainly treats objects as rigid.
For articulation accuracy: D3D-HOI and 3DADN (monocular articulated object methods).

The scoreboard (with context):

X-CLIP: ArtHOI gets 0.244 (best), meaning it follows the prompt like “open the cabinet” more faithfully.
Foot Sliding: ArtHOI scores 0.31 (lowest); like getting the best balance so feet don’t skate around.
Contact%: 75.64% (highest), like keeping a firm handle grasp for 3/4 of the sequence.
Penetration%: 0.08 (lowest), which is like almost no ghosting through solid objects.
Motion Smoothness: 0.87—competitive and meaningful among methods that actually interact (note: some non-zero-shot baselines look smoother partly because they barely touch the object; real contact adds natural motion variation).

Articulation recovery (monocular):

Mean rotation error: 6.71° vs. 21–25° for baselines. That’s like going from a C to an A on how accurately the door swings.
Max/min rotation error: Best overall, showing both fewer big mistakes and tighter fits.

Rigid object generalization:

Even when objects are rigid, ArtHOI still wins: best Foot Sliding (0.28), best Contact% (76.18%), and lowest Penetration% (0.06). So the method’s principles help across the board.

User study (51 participants):

Overall preference over TRUMANS: 98.04%; CHOIS: 95.28%; LINGO: 91.51%; ZeroHSI: 89.42%.
Especially strong in Contact Quality and Motion Smoothness versus most baselines.

Surprising findings:

Two-stage decoupling is the game-changer. When they tried joint optimization, performance dropped a lot (e.g., Contact% from 75.64% to 61.45%). This confirms that “object-first, human-second” resolves monocular ambiguity.
The hinge-binding constraint (quasi-static binding) massively stabilizes articulation. Removing it more than doubled articulation error (mean from 6.71° to 15.67°), showing the power of near-hinge constraints.
Contact loss $L_k$ is essential. Without it, hands miss handles more often (Contact% to 59.82%), even though the object articulation itself stays accurate. This means human realism hinges on good 3D contact targets.

Bottom line: ArtHOI consistently produces more believable interactions that match the prompt, keep stable feet and hands, and avoid the classic ghost-through-door problem—thanks to flow-driven part discovery and the two-stage reconstruction strategy.

05Discussion & Limitations

Limitations:

Flow tracking in low-texture or shiny regions can fail, confusing which parts move and causing hinge estimates to drift.
Complex mechanisms with multiple degrees of freedom (like fold-and-slide doors) or soft joints are harder because the current model expects one main rigid articulation.
Very long videos can accumulate small errors over time, gradually harming physical plausibility.
Assumes a fixed camera; camera motion adds ambiguity between ego-motion and object articulation.

Required resources:

A GPU with enough memory to render and optimize 3D Gaussians (e.g., ~48 GB for the reported setup), a video diffusion prior (e.g., KLing), point tracking (e.g., CoTracker), and SAM for masks.

When not to use:

Highly reflective, textureless scenes where optical flow is unreliable.
Multi-DOF linkages or deformable objects where a single SE(3) per moving part is insufficient.
Strong camera motion without stabilization or ego-motion compensation.

Open questions:

How to robustly handle multi-part, multi-DOF mechanisms with shared or time-varying axes?
Can we self-calibrate moving cameras and still separate ego-motion from articulation reliably?
Could physics simulators be softly integrated to anticipate contact forces and improve long-horizon stability?
How to make contact inference more robust without over-relying on 2D overlaps (e.g., using learned depth priors or tactile-like reasoning)?
Can we jointly learn category-agnostic part discovery so that even novel appliances are handled with minimal assumptions?

06Conclusion & Future Work

Three-sentence summary: ArtHOI turns the hard problem of making people and articulated objects interact realistically—using only a single video—into a two-step 4D reconstruction: first the object’s articulation, then the human’s motion. By using optical flow for part discovery, hinge-binding constraints for stability, and contact-guided human refinement, it creates physically plausible, time-coherent interactions. This approach beats baselines in contact accuracy, penetration avoidance, and articulation fidelity across everyday tasks like opening fridges and cabinets.

Main achievement: Showing that decoupled, reconstruction-first reasoning from monocular video priors can deliver zero-shot, geometry-accurate, and physically grounded human–object interactions beyond rigid manipulation.

Future directions:

Extend to multi-part, multi-DOF mechanisms and deformables, possibly with learned articulation priors.
Handle moving cameras via ego-motion estimation and joint optimization over camera, object, and human.
Add lightweight physics terms (e.g., force-aware contacts) to boost long-horizon realism.
Improve robustness of flow and contact inference in low-texture or reflective environments.

Why remember this: The clever “object-first, human-second” idea turns monocular ambiguity from a blocker into a blueprint—using motion to reveal structure and structure to guide contact—opening the door (pun intended) to scalable, realistic interaction synthesis for robots, games, and AR/VR.

Practical Applications

•Robot training data generation for opening/closing doors, drawers, fridges, and cabinets without motion-capture labs.
•VR/AR scene authoring: quickly create believable human–object interactions from a text prompt.
•Game development: auto-generate contact-accurate animations for props with hinges and sliders.
•Human factors and ergonomics simulations that require correct reach, grasp, and door mechanics.
•Virtual assistants and embodied agents that need to practice manipulation policies safely in simulation.
•Previsualization for film/TV to test how an actor might interact with sets or props.
•Interactive design reviews for appliances and furniture, validating handle placement and movement arcs.
•Education demos illustrating articulation (hinges, sliders) and safe manipulation using visual examples.
•Data augmentation for action recognition and HOI detection with physically grounded samples.
•Robotics benchmarking: standardized, zero-shot interaction tasks to compare manipulation policies.

Version: 1