Contact-Anchored Policies: Contact Conditioning Creates Strong Robot Utility Models

Zichen Jeff Cui; Omar Rayyan; Haritheja Etukuru; Bowen Tan; Zavier Andrianarivo; Zicheng Teng; Yihang Zhou; Krish Mehta; Nicholas Wojno; Kevin Yuanbo Wu; Manan H Anjaria; Ziyuan Wu; Manrong Mao; Guangxun Zhang; Binit Shah; Yejin Kim; Soumith Chintala; Lerrel Pinto; Nur Muhammad Mahi Shafiullah

Contact-Anchored Policies: Contact Conditioning Creates Strong Robot Utility Models

Beginner

Zichen Jeff Cui, Omar Rayyan, Haritheja Etukuru et al.2/9/2026

arXiv

Key Summary

•Robots often get confused by wordy instructions, so this paper tells them exactly where to touch instead of what to do in sentences.
•The authors introduce Contact-Anchored Policies (CAP), which guide a robot using a 3D 'contact anchor' point in space.
•CAP learns from only 23 hours of human demonstrations and still beats much bigger language-based robot models on core skills.
•A tiny 52M-parameter model plus contact anchors reached 83% pick, 81% open, and 96% close success on totally new scenes in one try.
•With an automatic checker and retries, success rises to around 90–98%, showing strong reliability.
•A fast simulator called EgoGym helps find mistakes early and predicts real-world performance well.
•CAP works across different robot arms without retraining, thanks to its focus on contact points instead of robot-specific details.
•A vision-language model can click the right spot to create contact anchors, matching human clicks in quality.
•Chaining these small skills with a high-level planner lets robots finish longer tasks like fetching items from cabinets or cleaning tables.
•The approach will be open-sourced, making strong, practical robot skills accessible to smaller labs and teams.

Why This Research Matters

Robots that understand exactly where to touch can help with daily chores like opening doors, grabbing groceries, or tidying rooms without long, confusing instructions. Because the models are small and fast, they can run on affordable hardware, bringing home robotics closer to everyday people and caregivers. Clear contact points make robots safer and more predictable in crowded spaces by reducing random motions. The approach needs far less data and compute, so schools, small labs, and startups can build capable robots, not just giant tech companies. Reliable zero-shot skills also speed up deployment in new homes, hospitals, and warehouses. By chaining small skills, robots can finish longer tasks like fetching items from cabinets or cleaning tables. Open-sourcing the code, data, and hardware design lets the whole community improve and apply the method widely.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how a map pin shows the exact spot you want to go, while a long description like “near the big tree by the river” can be confusing? Robots feel the same way—exact spots beat fuzzy words.

🥬 Filling (The Actual Concept)

What it is: This paper argues that telling robots exactly where to make contact (a 3D point) works better than giving them wordy instructions for hands-on tasks.
How it works (step by step):
1. Collect short human demonstrations of core skills like pick, open, and close.
2. For each demo, find the exact 3D place where the gripper touched—the contact anchor.
3. Train a small model to act using the camera view plus that contact point.
4. At test time, a person or a vision-language model clicks the spot, and the robot does the rest.
Why it matters: Language can be vague for precise motions (like “grab the mug”—where?), but a contact spot is crystal clear, fast, and portable to new robots.

🍞 Bottom Bread (Anchor) Example: Instead of saying “open the cabinet,” you click the cabinet handle in the image. The robot moves straight to that handle and opens it.

— New Concept — 🍞 Top Bread (Hook) Imagine asking a friend for help and only saying, “Do the thing over there.” That’s what many robots hear when we use words alone.

🥬 Filling (Vision-Language-Action models)

What it is: Vision-Language-Action (VLA) models take pictures and words and output robot actions.
How it works:
1. Read the instruction (“pick up the blue cup”).
2. Look at the scene image.
3. Predict a whole sequence of arm and gripper motions.
Why it matters: They can do many tasks, but words alone are often too fuzzy for exact hand placements.

🍞 Bottom Bread (Anchor) Example: “Grab the blue cup near the book” might still miss which side to grasp, leading to slips.

— New Concept — 🍞 Top Bread (Hook) Think of placing a sticker exactly where a door needs a push—suddenly, anyone knows where to put their hand.

🥬 Filling (Contact Anchor)

What it is: A contact anchor is a 3D point in space where the robot should touch the object.
How it works:
1. Use depth and camera pose to compute the 3D location of the point you clicked.
2. Track that point as the camera (on the gripper) moves.
3. Keep the point steady in world coordinates so the robot stays aimed.
Why it matters: Without a clear touchpoint, the robot wastes time guessing where to go and may grab the wrong thing.

🍞 Bottom Bread (Anchor) Example: Clicking the center of a drawer handle gives a perfect approach point for pulling it open.

— New Concept — 🍞 Top Bread (Hook) You know how a toolbox has separate tools for simple jobs—hammer, screwdriver, pliers? One giant “do-everything” tool would be clumsy.

🥬 Filling (Robot Utility Models)

What it is: Robot utility models are small, focused policies that do one useful skill very well (like pick, open, or close).
How it works:
1. Train one compact model per skill.
2. Give each model the camera view and a contact anchor.
3. Let a higher-level controller chain them for bigger jobs.
Why it matters: Smaller skill tools are easier to train, faster, and more reliable than one huge model.

🍞 Bottom Bread (Anchor) Example: To fetch a snack from a cabinet: run “Open,” then “Pick,” then “Close,” like three separate tools in a row.

— New Concept — 🍞 Top Bread (Hook) Before playing a sport in a tournament, teams practice in a gym to find and fix mistakes quickly.

🥬 Filling (Simulation-in-the-loop development)

What it is: Train and test models in a fast simulator during development to catch problems early.
How it works:
1. Build lots of randomized scenes in simulation.
2. Test new model checkpoints constantly while training.
3. Spot failure patterns and fix data or code before real robot trials.
Why it matters: Saves time, money, and wear on real robots.

🍞 Bottom Bread (Anchor) Example: If the robot often fails to lift high enough in sim, you add data filters and see the fix work before touching the real robot.

02Core Idea

🍞 Top Bread (Hook) Imagine giving a robot a GPS pin instead of a paragraph. Pins are quick; paragraphs are puzzling.

🥬 Filling (Contact-Anchored Policies)

What it is: The key idea is “tell the robot where to touch, not what to do in words.”
How it works:
1. Learn a policy that takes (camera image + contact anchor) as input.
2. Predict smooth arm motions and gripper actions toward that anchor.
3. Keep tracking the anchor as the camera moves.
4. Freeze the anchor after contact (like training data) to finish the action stably.
Why it matters: This removes guesswork. The anchor shrinks the search space from “anywhere in the picture” to “head straight here.”

🍞 Bottom Bread (Anchor) Example: Click the mug rim—robot aligns the gripper right there and lifts.

— Three Analogies —

Sticky note: Put a sticky note on a cabinet handle—anyone knows where to pull.
Treasure X on a map: Mark the X and go straight there; no long directions needed.
Bowling bumpers: The contact anchor acts like rails guiding the motion so the ball (gripper) goes down the right lane.

— Before vs After —

Before: Big language-heavy systems needed lots of data, were slow, and sometimes grabbed the wrong thing in new places.
After: A compact model plus a single contact point works in new homes, with new objects, and even on new robot arms.

— Why It Works (intuition) —

Precise targeting: The anchor encodes “where” directly, which is the hardest part of manipulation.
Less ambiguity: Many words map to one spot; the spot maps to one motion plan.
Geometry-centric: Contact is a physical truth—handles, edges, and rims behave similarly across scenes, so skills transfer.
Smaller brain, faster reflexes: Without a giant language brain, the model runs on a phone and still does the job.

— Building Blocks — 🍞 Top Bread (Hook) You know how learning to dance is easier if you copy a teacher step by step?

🥬 Filling (Behavior Cloning)

What it is: Behavior cloning means learning by imitating expert demonstrations.
How it works:
1. Record human demos (videos, poses, gripper states).
2. Train a model to map observations to the same actions.
Why it matters: It gives robots quick, reliable skills without complex reward design.

🍞 Bottom Bread (Anchor) Example: Watching many examples of “how to grasp a cup,” the robot learns the move.

🍞 Top Bread (Hook) Imagine compressing a big dance into a few memorable moves you can replay.

🥬 Filling (VQ-BeT)

What it is: Vector-Quantized Behavior Transformer (VQ-BeT) is a two-stage imitator that turns actions into discrete tokens, then predicts them.
How it works:
1. Stage 1: Learn a small action “alphabet” (codebook) from demos.
2. Stage 2: Given images + contact tokens, predict the next action tokens.
Why it matters: Tokens make learning stable and fast; transformers love sequences.

🍞 Bottom Bread (Anchor) Example: The robot learns a short “vocabulary” of grasp moves it can combine on the fly.

🍞 Top Bread (Hook) Think of watching a replay to find the exact moment a hand touched the ball.

🥬 Filling (Hindsight Contact Labeling)

What it is: After demos, the system finds when and where contact happened and labels earlier frames with that anchor.
How it works:
1. Detect the moment fingers stop closing (contact time).
2. Compute the 3D midpoint between fingers.
3. Back-project this anchor to previous frames using camera poses.
Why it matters: The model learns “approach-to-contact” precisely, not just the final grab.

🍞 Bottom Bread (Anchor) Example: For opening a drawer, the anchor tracks the handle before the pull, teaching a clean approach.

🍞 Top Bread (Hook) When you give directions, sometimes you point instead of speaking—that’s faster.

🥬 Filling (Contact Prompting)

What it is: At test time, a click in the image (from a human or a vision-language model) creates the contact anchor.
How it works:
1. Click pixel → use depth to get a 3D point.
2. Transform it as the camera moves so it stays fixed in the world.
3. Feed it to the policy to drive motion.
Why it matters: One click beats a paragraph, and machines can click, too.

🍞 Bottom Bread (Anchor) Example: “Point to the red mug”—the VLM clicks the mug, and the robot grasps it.

🍞 Top Bread (Hook) Before a concert, bands practice in a studio to catch issues early.

🥬 Filling (EgoGym)

What it is: EgoGym is a fast simulator with lots of randomized scenes for pick, open, and close.
How it works:
1. Procedurally generate many objects, textures, and layouts.
2. Run quick tests during training to spot overfitting and failures.
3. Use results to tweak data or models.
Why it matters: It predicts real-world success and speeds up iteration.

🍞 Bottom Bread (Anchor) Example: A model that lifts poorly in EgoGym also lifts poorly in real life—so fixes in sim help for real.

03Methodology

At a high level: Camera RGB-D + recent frames + a contact anchor → encode vision and contact → VQ-BeT predicts delta end-effector pose + gripper open/close → robot moves and updates the anchor in its moving camera frame → after contact, freeze the anchor and finish the skill.

Step 1: Data collection with a handheld gripper 🍞 Top Bread (Hook) Imagine teaching by doing—like guiding someone’s hand to show the right move.

🥬 Filling (What/Why/How)

What happens: Humans use a lightweight, 3D-printed handheld gripper with an iPhone camera rigidly mounted to collect demos of Pick, Open, and Close across 424 varied places (23 hours total).
Why this step exists: Matching the future robot view (camera-on-gripper) reduces the gap between training and deployment.
Example with data: 14,606 pick demos, 3,690 open, 2,069 close, all with synchronized RGB-D and camera poses at 30 Hz.

🍞 Bottom Bread (Anchor) Example: Holding the gripper, a person picks up toys in a messy living room while the phone records everything.

Step 2: Preprocessing and labels (including gripper state) 🍞 Top Bread (Hook) Think of tidying your notes before studying—it makes learning smoother.

🥬 Filling

What happens: Resize images (224×224), mirror trajectories to learn left-right symmetry, estimate gripper aperture from images using SAM2 masks, and compute end-effector motion directly from the iPhone’s tracked pose.
Why this step exists: Clean, consistent inputs help the model learn stable motions; symmetry doubles useful data.
Example: The system maps finger distance in pixels to a gripper-open value between 0 and 1.

🍞 Bottom Bread (Anchor) Example: Flipping a right-hinged cabinet demo to left-hinged helps the robot open cabinets on either side.

Step 3: Hindsight Contact Labeling and Contact Detection 🍞 Top Bread (Hook) It’s easier to mark the best step on a replay than while sprinting.

🥬 Filling

What happens: Detect the contact time when gripper closing stops; compute the 3D contact anchor between fingers; back-project that anchor to earlier frames using camera poses so every frame carries a target.
Why this step exists: It teaches the model to approach the exact touchpoint smoothly, not just the grab moment.
Example with data: For Close, contact is labeled during data collection by briefly closing upon touching the door.

🍞 Bottom Bread (Anchor) Example: The contact point rides along the drawer handle across frames, guiding the approach path.

Step 4: Policy learning with VQ-BeT (Behavior Cloning) 🍞 Top Bread (Hook) Like learning piano by copying finger positions, then compressing them into easy chunks.

🥬 Filling

What happens: A ResNet-50 encoder (pretrained with MoCo) turns each RGB image into a feature; the 3D contact anchor is linearly projected into a contact feature; concatenate both to form an observation token. A VQ-VAE builds a discrete action codebook; a transformer predicts the next action tokens (delta pose + gripper).
Why this step exists: The tokenized action space stabilizes learning; concatenated contact + vision tells the model both “what it sees” and “where to go.”
Example: With context of the past 3 frames, the model outputs a 7D action (3D translation, 3D rotation, 1D gripper) each step.

🍞 Bottom Bread (Anchor) Example: Seeing a drawer and its handle point, the model outputs tiny moves to align, touch, and pull.

Step 5: Contact prompting during inference 🍞 Top Bread (Hook) Point, click, go—like tapping where you want to zoom on a phone camera.

🥬 Filling

What happens: A human or a vision-language model clicks a pixel; the depth map turns it into a 3D contact anchor; the robot’s kinematics keep that anchor steady in the world as the camera moves; after closing the gripper, the anchor is frozen (matching training).
Why this step exists: The click gives precise intent; tracking preserves the target despite camera motion.
Example: “Point to the yellow bag” → VLM clicks bag → robot executes pick.

🍞 Bottom Bread (Anchor) Example: The anchor updates frame-by-frame so the robot doesn’t drift off-target.

Step 6: Simulation-in-the-loop with EgoGym 🍞 Top Bread (Hook) Practice makes progress—especially in a fast, forgiving gym.

🥬 Filling

What happens: EgoGym (MuJoCo-based) spawns tons of randomized pick/open/close scenes with varied objects, sizes, textures, and distractors. During training, checkpoints are evaluated to detect overfitting and common failure modes.
Why this step exists: Real-world tests are slow and expensive; sim lets you quickly spot issues that also show up on real robots.
Example: A low-lift failure discovered in sim led to filtering out static frames in data, improving real success next round.

🍞 Bottom Bread (Anchor) Example: Success rates in EgoGym lined up with real success across four model checkpoints.

The Secret Sauce

Minimal, precise conditioning: A single contact anchor collapses ambiguity.
Modular skill library: Small policies for pick/open/close are easier to train and chain.
Sim-first iteration: EgoGym reveals failure modes early, so fixes transfer to reality.
Efficient footprint: About 52M parameters, running on phones and onboard CPUs.

04Experiments & Results

The Test

What they measured: Zero-shot success on unseen environments and objects for Pick, Open, and Close. They also tested different robot arms (Stretch 3, Franka FR3, XArm 6, UR3e), automatic contact clicks by VLMs, and whether a retry system improves reliability. Finally, they checked how well EgoGym results predict real-world performance.

The Competition

AnyGrasp (a grasp pose system), π0.5-DROID (a large VLA on Franka/DROID embodiment), and stretch-open (a modular opener on Stretch) served as baselines.

The Scoreboard (with context)

Single try, unseen scenes/objects on Stretch: • Pick: CAP 83% (like an A- when others got around a C+). AnyGrasp scored 47%. • Open: CAP 81% (a strong B+), vs stretch-open 58%. • Close: CAP 96% (an A+).
With automatic verifier and retries (up to 10): • Pick 90%, Open 91%, Close 98% (this is like retaking a quiz until you get it right—with a fair referee).
With VLM clicks (no human): performance matches human clicks • Pick 81%, Open 80%, Close 97%—showing robots can be fully autonomous once a VLM supplies the click.
Cross-embodiment (same CAP policy, different robot arms): • Franka 79–81%, XArm 83%, UR3e 70% (UR3e was limited by shorter reach). This shows the skill transfers.
Compared to big VLA π0.5-DROID on Franka: CAP-VLM 81% vs π0.5-DROID 25% (CAP is like finishing most of your chores while the big model barely starts).

Surprising Findings

A small model plus a single contact point beat much larger language-based systems on atomic skills by 23–56%.
VLM-chosen contact points were as good as human clicks, enabling full autonomy.
EgoGym’s success rates correlated with real robot success across multiple checkpoints—fast simulation was a good compass.
An ablation without contact anchors (RGB-only) dropped to 58% on Close from 96%, proving contact conditioning is the key ingredient.
Adding more visual distractors didn’t hurt CAP with oracle anchors, but it did hurt VLM-driven CAP and big VLAs, which picked wrong objects more often.

Longer Tasks by Chaining Skills

With a high-level controller and tool-calling, robots completed multi-step goals: • Fetch coffee beans from a cabinet (Open → Pick → Drop → Close): 6/10 success; most failures were due to a verifier calling a partial open “good,” causing later collisions. • Clear a table (repeated Pick → Drop): 10/10 success, though the verifier sometimes asked for unnecessary retries.

Takeaway

A precise point to touch plus a compact skill model delivers dependable zero-shot manipulation across places, objects, and even robot bodies—and scales to longer tasks when chained.

05Discussion & Limitations

Limitations

Single-contact focus: Tasks needing multiple simultaneous contacts (two-handed actions, twisting and pulling at once) aren’t handled yet.
Anchor quality: Bad depth, wrong clicks, or occlusions can place the anchor incorrectly, derailing the motion.
Verifier fragility: Automatic success/failure checks can be fooled (e.g., counting a partial open as success), leading to downstream trouble.
No force sensing: CAP relies on vision and kinematics; delicate materials or tight fits may need force/torque feedback.
Clutter extremes: When many similar objects overlap, even a VLM may click the wrong target, reducing autonomy quality.

Required Resources

About 23 hours of diverse handheld demos using a 3D-printed gripper + iPhone (AnySense app).
Moderate training compute (ResNet-50 visual encoders, VQ-BeT policy ~52M params) and MuJoCo for EgoGym.
A robot with at least 6-DoF and a compatible gripper; onboard CPU or small GPU for inference.

When NOT to Use

Multi-contact, bimanual, or force-critical tasks (e.g., screwing a cap, folding cloth precisely) until the method is extended.
Scenes without reliable depth or camera pose, which are needed to make a good 3D anchor.
Purely language-driven household assistants where no one (or no VLM) can provide accurate clicks.

Open Questions

Multi-anchor extension: How to handle several anchors at once or a distribution over possible contact points?
Self-proposed anchors: Can the policy learn to choose anchors by itself and revise them mid-task?
Stronger verifiers: Can success checking be folded into training via reinforcement learning to reduce false calls?
Hybrid goals: What’s the best mix of words (for sequence planning) and contact points (for precise motion)?
Uncertainty handling: How should the model act when depth is noisy or the anchor is partially occluded?

06Conclusion & Future Work

Three-Sentence Summary

This paper shows that giving robots a single, precise 3D contact point (a contact anchor) lets a small model perform pick, open, and close robustly in new places and on new objects.
With only 23 hours of demos plus a fast simulator (EgoGym) for iteration, these Contact-Anchored Policies outperform much larger language-based systems and run on phones and multiple robot arms.
Chaining these compact skills with a high-level controller enables longer chores, and automatic clicks from VLMs remove humans from the loop.

Main Achievement

Replacing language conditioning with contact conditioning creates strong, modular robot utility models that generalize zero-shot with far less data and compute.

Future Directions

Scale to multiple simultaneous contacts and bimanual actions; teach policies to propose and refine anchors themselves; integrate robust success verification into end-to-end training; better fuse language for high-level planning with contact for low-level precision.

Why Remember This

When the job is hands-on, telling a robot exactly where to touch beats giving it an essay. That one small shift—contact as the interface—turns out to be a powerful, practical shortcut to reliable general robot skills.

Practical Applications

•Home assistance: Click the handle to open cabinets, then pick and place groceries onto shelves.
•Elderly care: Precisely grasp medication bottles or utensils with minimal setup time.
•Warehouse kitting: Point to the correct bin handle and item to enable quick pick-and-place flows.
•Office cleanup: Repeated pick-and-drop to clear desks into recycling bins.
•Kitchen prep: Open drawers, fetch tools, and put them back safely using anchor clicks.
•Mobile inspection: Click latches or panels to open/close enclosures during routine checks.
•Education kits: Run CAP models on phones and low-cost robot arms for classroom demos.
•Rapid prototyping: Use EgoGym to iterate on new grippers or grasp strategies before real tests.
•Hospital logistics: Open cabinet doors and retrieve supplies without text-heavy programming.
•Field service: A technician or VLM provides anchor clicks for robots to operate unfamiliar hardware.

Version: 1

Key Summary

Why This Research Matters

Detailed Explanation

01Background & Problem Definition

02Core Idea

03Methodology

04Experiments & Results

05Discussion & Limitations

06Conclusion & Future Work

Practical Applications

Notes