UltraDexGrasp: Learning Universal Dexterous Grasping for Bimanual Robots with Synthetic Data

Sizhe Yang; Yiman Xie; Zhixuan Liang; Yang Tian; Jia Zeng; Dahua Lin; Jiangmiao Pang

UltraDexGrasp: Learning Universal Dexterous Grasping for Bimanual Robots with Synthetic Data

Intermediate

Sizhe Yang, Yiman Xie, Zhixuan Liang et al.3/5/2026

arXiv

Key Summary

•Robots need many different ways to grab things, just like people use pinch, tripod, whole-hand, or two hands together.
•This paper builds a big synthetic (computer-made) dataset and a smart policy so bimanual robots can pick up tiny, medium, and huge objects safely.
•They mix two strengths: optimization finds stable, physics-respecting grasps, and motion planning turns them into smooth, collision-free moves.
•Their dataset, UltraDexGrasp-20M, has 20 million frames over 1,000 objects, covering four grasp strategies and lots of shapes, sizes, and weights.
•A simple point-cloud policy with unidirectional attention learns from this data and predicts actions as a probability (not a single guess), making it steadier.
•In simulation, the policy gets 84.0% average success, beating strong baselines by a big margin; in the real world, it reaches 81.2% without extra training.
•Key ideas that help: picking grasps close to the robot’s current pose, adding robot “imaged point clouds,” and randomizing camera and joint settings to bridge sim-to-real.
•The method generalizes well to new objects and sizes, including small pinch/tripod, medium whole-hand, and large bimanual lifts.
•Limitations include reliance on synthetic data and simpler scenes than cluttered households, but the approach scales and transfers surprisingly well.
•This work suggests we can train robust two-hand robot grasping mostly in simulation and make it work in the real world.

Why This Research Matters

Robots that can switch grips like people are far more useful at home, in hospitals, and in warehouses. This work shows we can train that versatility mostly in simulation by respecting physics and careful planning, then run it on real robots with strong success. That lowers the cost and time to build capable systems, since we don’t need to hand-collect endless real demos. The approach also scales across sizes—from tiny parts to large, heavy items—unlocking broader tasks with one unified policy. By open-sourcing the pipeline, the community can build even larger, richer datasets and policies. Ultimately, this moves us closer to safe, reliable, human-level manipulation in everyday environments.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how you use different grips for different things—pinch tiny beads, hold a mug with your whole hand, or carry a big box with both hands? Your brain picks the right grip automatically.

🥬 Filling (The Actual Concept):

What it is: Universal dexterous grasping means a robot can choose and perform many kinds of grasps (pinch, tripod, whole-hand, bimanual) across many objects, just like you do.
How it works: The robot needs (1) a way to invent safe grasps that match object shapes and physics, (2) a way to move both arms to those grasps smoothly, and (3) a brain (policy) that sees 3D points and picks good actions.
Why it matters: Without this, robots drop things, crash into objects, or can only handle a few shapes.

🍞 Bottom Bread (Anchor): Picking up a grape (pinch), a smartphone (tripod), a basketball (whole-hand), or a heavy box (two hands) all need different strategies—this paper teaches robots to pick the right one.

The World Before: Robotic grasping advanced a lot for simple grippers (like two fingers) and single hands. But handling all sizes—from tiny screws to big boxes—and switching strategies on the fly with two hands was rare. Data was the big blocker: it’s hard to get huge, diverse, physics-correct examples for two hands with many joints. Methods split into three camps:

Reinforcement learning experts (great but often deterministic, low diversity, heavy to train, and often limited to a few object types).
Optimization or learning-based grasp synthesis (can produce grasps, but often open-loop, struggle with real-world motion, and ignore dual-arm coordination).
Datasets covered mostly single-hand grasps; bimanual cases were scarce.

The Problem: Robots need grasps that both “hug” the object geometry and withstand pushes and gravity. With two arms and many fingers, planning becomes a maze: more joints, more ways to collide, and more ways to fail. Generating lots of high-quality, multi-strategy, two-hand data—fast and at scale—was missing.

Failed Attempts (and why they struggled):

RL experts: After training, one observation often maps to one action, reducing grasp variety; also expensive to scale to many objects and strategies.
Open-loop synthesis: Looks good standing still, but can break down when the robot must actually move arms, avoid collisions, and squeeze stably.
Bimanual: Few works; some simplify contact physics or skip closed-loop control, so they don’t survive real-world jitters.

The Gap: We needed a pipeline that (1) synthesizes physically robust grasps with detailed contact forces, (2) plans dual-arm, collision-free motions, (3) covers multiple strategies, and (4) does it big enough to train a general policy—then bridges from sim to real.

Real Stakes (Why you should care):

Home help: From lifting laundry baskets to picking up crumbs, robots need many grips.
Warehousing: Safely handling big, heavy, or oddly shaped items saves time and avoids damage.
Healthcare: Gentle, reliable grasps for assistive tasks matter for safety and dignity.
Manufacturing: Fewer custom fixtures if robots can adapt their hands.
Education and research: Open-source tools and large datasets speed up everyone.

🍞 Anchor Example: Imagine a robot switching from a two-finger pinch to grab a battery, to a whole-hand hold for a coffee mug, to using both hands for a heavy toolbox—reliably, without being reprogrammed each time. That’s the world this paper pushes toward.

02Core Idea

🍞 Top Bread (Hook): Imagine building a Lego tower: one hand holds pieces steady while the other places new bricks. If your hands don’t coordinate, the tower collapses.

🥬 Filling (The Actual Concept):

What it is: The key insight is to fuse two worlds—optimization that finds physically solid grasps and planning that turns them into smooth, dual-arm demos—then train a simple, robust point-cloud policy on that synthetic data.
How it works (step by step):
1. Automatically generate candidate grasps that match the object’s shape and physics (force balance, contact friction).
2. Filter for reachability and collisions; prefer the grasp closest to the robot’s current pose for smooth motion.
3. Plan four stages (pregrasp, grasp, squeeze, lift) with dual-arm coordination.
4. Record millions of frames across strategies and objects.
5. Train a point-cloud policy with unidirectional attention and probabilistic action outputs.
Why it matters: Without this full pipeline, you either get pretty grasps that break during motion or nice motions that don’t hold up physically—and the policy won’t generalize.

🍞 Bottom Bread (Anchor): It’s like rehearsing a dance: you choreograph two dancers (arms) with safe steps (planning) based on good balance points (optimization), film tons of practice (dataset), and then teach a new dancer (policy) who can perform with any partner (new objects).

Multiple Analogies (same idea, three ways):

Chef’s hands: One hand stabilizes the cutting board while the other slices—optimization picks where to press and hold; planning ensures the moves don’t bump bowls; practice videos train a new chef.
Rock climbing: You choose holds that won’t slip (optimization), plan your body moves so you don’t swing into the wall (planning), and after lots of routes (dataset), you can handle new cliffs (policy generalization).
Orchestra: The score ensures harmony (optimization physics), the conductor guides timing (planning), and recordings train new musicians (policy) who can play new pieces (novel objects).

Before vs After:

Before: Either stable grasps without real motion execution, or motions that ignored physics, often single-hand only; policies struggled with new objects and big size changes.
After: Unified pipeline yields physically plausible, reachable, coordinated demos across four strategies; a compact policy learns from 20M frames and transfers to real robots with 81.2% success.

Why It Works (intuition, not equations):

Physically valid contacts + friction-aware forces prevent slipping.
Picking grasps near the current pose reduces arm travel, so planning is easier and safer.
Four-stage plans structure the motion, so the policy sees consistent patterns.
Unidirectional attention makes action tokens read from scene points (not from each other), stabilizing focus.
Predicting a probability over actions avoids overconfident, jerky commands.

Building Blocks:

Scene point clouds that include the robot’s own shape (so the policy knows where arms/hands are).
Optimization-based grasp synthesizer with friction and force-closure logic.
Dual-arm motion planning into pregrasp → grasp → squeeze → lift.
Data filters and randomizations (camera pose, joint impedance) to bridge sim-to-real.
A decoder-only transformer with learnable action queries and truncated-normal outputs.

🍞 Anchor Example: Training only in simulation, the robot learns to pinch a battery, whole-hand a mug, and bimanually lift a big storage bin—then does the same in your lab, zero-shot.

03Methodology

At a high level: Input (object + robot in a simulated scene) → [Optimization-based grasp synthesis] → [Dual-arm motion planning + execution + filtering] → Output (millions of high-quality demonstrations), then train a point-cloud policy that maps scene points to actions.

We’ll introduce each key concept with a Sandwich and show concrete examples.

Bimanual Grasping 🍞 Hook: Imagine carrying a heavy watermelon—you naturally use both hands to keep it from slipping. 🥬 Concept: Bimanual grasping means using two hands together to hold or move an object.

How it works: Plan both hands’ positions, orientations, and finger joints so they support and balance the object.
Why it matters: One hand might not supply enough support or torque; two hands improve stability and reach. 🍞 Anchor: Lifting a wide box: left hand holds one side, right hand the other, so it doesn’t tip.

Grasp Synthesis (finding stable grasps) 🍞 Hook: Think of finding sturdy spots for your fingers when holding a big book without dropping it. 🥬 Concept: Grasp synthesis is the process that chooses where and how hands contact the object so it won’t slip.

How it works: Start near the object, propose contact points (fingertips/palm), and optimize to match target “hold” forces and avoid collisions.
Why it matters: Random guesses often slip or collide; optimization bakes in physics. 🍞 Anchor: It’s like trying different finger placements on a slippery soap bar until you find the hold that doesn’t slide.

Friction Cone (no slip rule) 🍞 Hook: Shoes grip the floor better when friction is high; socks on ice slide. 🥬 Concept: The friction cone says your sideways push must stay within a limit set by friction and normal force so the contact won’t slide.

How it works: It uses the rule $\|f_{tan}\| \le \mu\,\|f_n\|$ and $f_z \ge 0$ . For example, if $\mu=0.5$ and the normal force is $\|f_n\|=8$ , then the maximum sideways force allowed is $0.5\times 8=4$ ; so any $\|f_{tan}\|$ must be $\le 4$ .
Why it matters: Exceed the limit and the fingers slip. 🍞 Anchor: Like not pushing a book sideways too hard when you’re just barely pressing down on it.

Grasp Wrench Space (resisting pushes/torques) 🍞 Hook: When you hold a door against wind, your hands must counter both push and twist. 🥬 Concept: The grasp wrench space collects all pushes and twists (wrenches) your contacts can apply to fight disturbances.

How it works: $w = \sum_{i=1}^{k} G_i f_i$ . Example: with two contacts, suppose $G_1$ and $G_2$ are identity-like for simplicity and $f_1=\begin{pmatrix}2\\0\\0\end{pmatrix}$ , $f_2=\begin{pmatrix}0\\3\\0\end{pmatrix}$ . Then $w=\begin{pmatrix}2\\3\\0\end{pmatrix}$ , which can balance a push of $\begin{pmatrix}-2\\-3\\0\end{pmatrix}$ .
Why it matters: If your wrench space is too small, external forces win and you drop the object. 🍞 Anchor: Two kids pushing on a swing from different sides can cancel motion if their pushes add up just right.

Choosing the Preferred Grasp (close to current pose) 🍞 Hook: If your pencil is near your right hand, you’ll use the right hand instead of walking around the desk. 🥬 Concept: Among valid grasps, pick the one closest to the robot’s current hand pose to save motion and reduce risk.

How it works: Use the distance $d(T_a,T_b)=\|t_a-t_b\|_2+\lambda\, d_{rot}(R_a,R_b)$ . If the translation gap is $0.10$ m, $\lambda=0.5$ , and the rotation gap is $60^\circ$ (so $d_{rot}\approx \pi/3 \approx 1.047$ ), then $d\approx 0.10+0.5\times 1.047\approx 0.6235$ .
Why it matters: Smaller moves mean fewer collisions, faster actions, and smoother control. 🍞 Anchor: Choosing the nearby cup to drink from instead of stretching for a far one.

Motion Planning in Four Stages 🍞 Hook: When you park a car, you approach, align, stop, and then turn off the engine—order matters. 🥬 Concept: The grasp is executed via pregrasp → grasp → squeeze → lift.

How it works: Plan collision-free, dual-arm trajectories; group tiny steps to avoid hesitations; execute with PD control.
Why it matters: Without staging and planning, hands might bump the object or each other, or squeeze too early. 🍞 Anchor: Approach a fragile ornament, gently grasp, tighten just enough, then lift steadily.

Point Clouds (seeing in 3D) 🍞 Hook: Imagine the world made of tiny dots that show where surfaces are. 🥬 Concept: A point cloud is a set of 3D points from depth cameras that describes the scene.

How it works: The policy reads these points, including an “imaged” robot model so it knows where its own arms are.
Why it matters: Without accurate 3D, the robot can’t aim hands precisely. 🍞 Anchor: Like a connect-the-dots drawing of the room that helps the robot see objects and itself.

Farthest Point Sampling (FPS) 🍞 Hook: If you’re picking landmarks on a map, you spread them out to cover the area well. 🥬 Concept: FPS picks points that are far apart to summarize the scene without losing coverage.

How it works: Downsample to 2,048 points to balance detail and speed.
Why it matters: Too many points slow learning; too few miss small parts. 🍞 Anchor: Choosing spaced-out checkposts to still navigate the whole park.

PointNet++ Encoder 🍞 Hook: You first notice local details (a mug’s handle), then the whole object. 🥬 Concept: PointNet++ builds local-to-global 3D features by grouping neighbors and summarizing them.

How it works: Two set-abstraction layers gather nearby points (like 32-neighbor groups) and pool features, then downsample to 256 for higher-level cues.
Why it matters: Fine geometry (like edges) matters a lot for finger placement. 🍞 Anchor: Spotting the tiny lip of a lid so you can pinch it.

Unidirectional Attention (focus that flows one way) 🍞 Hook: Picture a spotlight that only shines from the questions (actions) to the scene, not back and forth. 🥬 Concept: In the transformer, action query tokens read from point features but do not distract each other.

How it works: Queries attend to scene features to form action latents; this reduces chatter between queries.
Why it matters: Less internal noise → steadier, more reliable actions. 🍞 Anchor: A student asks the board for info; the answers come from the board (scene), not from other students whispering.

Probabilistic Actions (bounded Gaussian) 🍞 Hook: When you throw a ball, you aim, but you also allow for a little wiggle. 🥬 Concept: The policy predicts a Gaussian distribution over actions with bounds, not a single point, improving stability.

How it works: Suppose the mean is $0.20$ and the standard deviation is $0.05$ , clipped to $[-1,1]$ . A sample like $0.18$ is plausible and safe.
Why it matters: If the scene is uncertain, distributions avoid overconfident mistakes. 🍞 Anchor: Instead of saying “move exactly 10 cm,” the robot says “about 10 $cm ± a$ bit,” then refines next step.

Simulation-to-Real Transfer 🍞 Hook: Practicing in a flight simulator teaches pilots skills they use in real planes. 🥬 Concept: Train fully in sim, then run on real robots by narrowing gaps (camera calibration, noise filtering, robot imaged point clouds, dynamics randomization).

How it works: Align coordinates, clean depth outliers, add the robot’s modeled points, randomize joint impedance.
Why it matters: Without this, policies overfit to clean sim data and fail on messy real input. 🍞 Anchor: Like practicing piano on a digital keyboard that’s tuned to feel like a real piano.

Secret Sauce (what’s clever):

Integrating physically grounded optimization with practical dual-arm planning, at scale.
Preferring grasps near the current pose for natural, reachable motions.
A compact, clean policy (no extra bells and whistles) that still generalizes via unidirectional attention and probabilistic outputs.
Robot “imaged point clouds” to robustly bridge sim and real sensing.

04Experiments & Results

The Test (what, why):

They measured grasp success rates across small, medium, and large objects, both seen and unseen during training, in simulation and in the real world.
Why: To prove universal dexterous grasping needs to handle many sizes and novel shapes, not just a few examples.

The Competition (baselines):

DP3: A strong diffusion-policy baseline using point clouds and robot state.
DexGraspNet: An optimization-based pose generator using full object meshes (single-hand only), paired with motion planning for execution.

The Scoreboard (with context):

Simulation (600 diverse objects, 10 trials each):
- Ours: 84.0% average success. Think: an A grade across a very mixed, hard exam.
- DP3 (trained on same data): much lower; our method improves by roughly 37.3 percentage points—like jumping from a middling C to a solid A.
- DexGraspNet: 58.8% on small/medium, cannot do large bimanual grasps (single-hand only). That’s like completing only part of the test.
Unseen objects: Ours still around mid-80% (83.4%), showing strong generalization—like acing surprise questions.
Data scaling (Fig. 5): Performance steadily rises with more training data, crossing the data-generation baseline once exceeding about 1M frames. More (good) practice, better performance.

Real World (two UR5e, two XHand, Azure Kinect cameras, 25 objects, 15 tries each):

Ours: 81.2% average success, zero-shot from sim. That’s like keeping an A- when moving from practice tests to the real exam with noise and distractions.
DP3 and DexGraspNet trail behind; DexGraspNet again limited by single-hand assumptions and dependence on perfect object meshes.

Surprising/Notable Findings:

Sim-only training transfers strongly with careful sim-to-real steps (camera calibration, SOR filtering, imaged robot point clouds, dynamics randomization).
Picking grasps near the current pose made motions look more human-like and reduced execution failures.
Unidirectional attention and probabilistic actions both gave >10% absolute boosts over ablations, suggesting architectural simplicity with the right inductive biases beats heavier, more complex models here.

An Intuition Check:

Multi-strategy coverage is crucial: small items often need pinch/tripod; medium prefer whole-hand; large require bimanual. A single strategy fails often, which is why DexGraspNet’s single-hand limit hurts large-object performance.
Point-cloud fidelity matters: adding “imaged” robot points stabilizes perception of self vs. object, reducing accidental collisions.

Bottom Line:

In both sim and reality, the approach consistently outperforms baselines, especially on large and unseen objects, validating that synthetic, physics-and-planning-grounded data can teach versatile two-hand grasping at scale.

05Discussion & Limitations

Limitations (honest look):

Synthetic bias: Even with randomization and imaged robot points, synthetic scenes differ from cluttered, glossy, transparent, or deformable real objects; performance may dip in highly cluttered shelves or with squishy packaging.
Mesh/physics assumptions: Hard-finger contacts and friction cones are approximations; unusual materials (very low friction, sticky tape, deformables) break assumptions.
Bimanual scope: While four strategies are covered, finer-grained in-hand reorientation or rapidly changing contacts (e.g., sliding grasps) aren’t the focus.
Sensing limits: Commodity depth cameras generate noisy or missing points for shiny/transparent items; heavy occlusions can hide crucial geometry.

Required Resources:

Compute for large-scale data generation (optimization + motion planning) and training on 20M frames.
Dual-arm robot with dexterous hands and calibrated cameras; software stack for motion planning and control.
Storage for dataset and logs; GPU(s) for training the point-cloud transformer.

When NOT to Use:

Soft, deformable, or fluid-filled objects where rigid-contact assumptions fail.
Extreme clutter where safe approach paths are too narrow and no grasp fits near the current pose.
Tasks needing complex post-grasp manipulation (e.g., precise in-hand rotation) not represented in the dataset.

Open Questions:

How to extend to cluttered, tight-packed shelves with grasp-then-regrasp or non-prehensile pre-shaping (pushing, sliding)?
Can we model deformable contacts or suction in the same pipeline for groceries, clothes, or bags?
How far can language or part-aware models help choose strategy (pinch vs. whole-hand) by object function?
Can few-shot real data further boost sim-trained policies with minimal labeling (self-calibration on-the-fly)?
How to incorporate tactile sensing to adapt squeeze forces mid-grasp and prevent slips in difficult materials?

06Conclusion & Future Work

Three-sentence summary:

This paper introduces UltraDexGrasp, which fuses physics-aware grasp optimization with dual-arm motion planning to create UltraDexGrasp-20M, a massive multi-strategy dataset for two-hand dexterous grasping.
A simple point-cloud policy with unidirectional attention and probabilistic actions learns from this synthetic data to perform pinch, tripod, whole-hand, and bimanual grasps across diverse objects.
The policy achieves 84.0% success in simulation and 81.2% zero-shot in the real world, outperforming strong baselines.

Main Achievement:

Proving that large-scale, synthetic, physics-and-planning-grounded data can train a compact, robust policy for universal bimanual dexterous grasping that transfers to reality without extra fine-tuning.

Future Directions:

Add cluttered-scene reasoning, deformable-object modeling, and tactile feedback for adaptive squeezing and regrasping.
Explore language or part-aware cues to pick functional grasps, and integrate few-shot real data for rapid adaptation.
Expand strategies to include in-hand reorientation and dynamic grasps for even more general manipulation.

Why Remember This:

It reframes “you must collect tons of real demos” into “you can synthesize diverse, physics-valid, planned demos at scale,” then learn a generalist two-hand grasping policy that really works on real robots. That’s a practical, scalable path toward human-like manipulation.

Practical Applications

•Home robots that pick up toys (pinch), mugs (whole-hand), and laundry baskets (bimanual) without reprogramming.
•Warehouse picking systems that safely handle tiny items to bulky boxes in one workflow.
•Hospital support robots that gently grasp medical tools or steady trays with two hands.
•Manufacturing assistants that adapt grasps to new parts without new fixtures.
•Service robots in retail that stock shelves, switching strategies based on item size and fragility.
•Lab automation that manipulates vials (pinch), instruments (tripod/whole-hand), and heavy equipment (bimanual).
•Education platforms for teaching dual-arm coordination and grasp physics with open-source tools.
•Inspection/maintenance bots that hold a panel with one hand while operating a tool with the other.
•Household assistive devices that help elderly users by safely lifting and passing objects of many kinds.
•Robotic research testbeds for studying sim-to-real transfer of complex two-hand skills.

Version: 1