ArtLLM: Generating Articulated Assets via 3D LLM

Penghao Wang; Siyuan Xie; Hongyu Yan; Xianghui Yang; Jingwei Huang; Chunchao Guo; Jiayuan Gu

ArtLLM: Generating Articulated Assets via 3D LLM

Intermediate

Penghao Wang, Siyuan Xie, Hongyu Yan et al.3/1/2026

arXiv

Key Summary

•ArtLLM is a 3D large language model that turns a rough 3D shape (from an image, text, or mesh) into a complete, movable 3D object with parts and joints.
•It writes a tokenized 'blueprint' of parts and joints (like a recipe) and then asks a part-aware generator to make each part’s detailed geometry.
•Instead of guessing exact numbers, ArtLLM predicts from smartly chosen bins (quantization), which makes it stable and accurate.
•It handles different numbers of parts and joint types in one go (autoregressive prediction), so it works for simple doors and more complex furniture.
•A physics-based check fixes joint limits to avoid parts bumping into each other when they move.
•On the PartNet-Mobility benchmark, it beats strong baselines in both part layout and joint prediction while running faster.
•It generalizes to real objects and can build digital twins that move like the real thing, helping robots learn safely in simulation.
•The training corpus mixes curated datasets and procedurally generated objects, covering many categories and structures.
•The pipeline exports standard URDF assets that plug into common robot simulators.
•This bridges the gap between pretty shapes and correct motion, making 3D worlds more interactive and useful.

Why This Research Matters

ArtLLM turns single images or text prompts into 3D objects that don’t just look right but also move right. That means faster game and AR/VR content creation without hand-tuning every hinge. Robots can practice on faithful digital twins before touching real hardware, saving time and reducing breakage risk. Educators and researchers can quickly build realistic, interactive datasets for training and testing. Designers can iterate on product concepts with both shape and motion in mind. And because the system outputs URDF, it plugs into common simulators immediately, accelerating pipelines across many fields.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine a toy box full of things that move—lids flip, drawers slide, and handles rotate. It’s fun because the parts don’t just look real; they act real too.

🥬 The Concept: An articulated object is a 3D thing made of parts that can move around joints (like doors on hinges or drawers on sliders).

How it works: You split the object into parts, connect parts with joints (hinge, slider, etc.), and set limits so they move naturally.
Why it matters: Without correct parts and joints, a beautiful 3D model can’t open, close, or interact, which breaks games, sims, and robot training.

🍞 Anchor: A kitchen cabinet in a game looks right and opens properly because it’s an articulated object with a door (part), a hinge (joint), and a swing range (limit).

The world before: People could already make great-looking 3D shapes from images or text. But making those shapes actually move like the real things? That was the hard part. Two popular paths tried to solve this:

Optimization-based reconstruction: Given many views or a video, optimize geometry and joints. It worked, but was slow per object, often low-fidelity, and mostly handled simple one-joint cases.
Retrieval-based assembly: Pick parts from a fixed library and piece them together fast. It was quick, but shapes started looking repetitive, and it struggled with novel geometry.

🍞 Hook: You know how a Lego set looks amazing when finished—but if you only have a small box of bricks, you keep building the same castle.

🥬 The Concept: Retrieval-based generation uses a fixed library of parts to assemble objects.

How it works: Detect a rough layout, fetch best-matching parts from the library, and snap them together.
Why it matters: If the library doesn’t have the exact part or size, you get mismatches and repeated designs.

🍞 Anchor: If your part library lacks a tall, skinny door, your fridge door ends up too short or too wide.

Meanwhile, the 3D generation community made huge strides—point/voxel-native models, part-aware generators, and amazing mesh quality. But there was a big gap: geometry didn’t understand motion. Models could make a great-looking handle that wasn’t connected to anything or a cabinet that couldn’t open.

🍞 Hook: Think of a pretty car drawing with wheels that don’t turn. Nice to look at, but it won’t roll.

🥬 The Concept: The geometry–motion disconnect is when a 3D model looks right but lacks correct moving parts.

How it works: Pure geometry models focus on shape; they don’t know which parts should move or how.
Why it matters: Robots can’t practice, games feel fake, and simulations can’t test real interactions.

🍞 Anchor: A microwave that can’t open is useless for a cooking game or a robot kitchen demo.

The missing piece was a unified brain that could understand a rough 3D object, decide where the parts should be, define how they move, and then create detailed geometry that matches this plan. That’s the gap ArtLLM fills.

🍞 Hook: You know how architects draw a floor plan before builders make the actual rooms?

🥬 The Concept: ArtLLM first writes a tokenized blueprint of parts and joints, then a generator builds the detailed parts.

How it works: From a point cloud, it predicts part boxes and joints as tokens; then a part-aware generator makes each part’s shape.
Why it matters: Planning first makes the final object consistent: the doors fit, the hinges align, and the motions are correct.

🍞 Anchor: The plan says “two doors on the front, both hinge on the sides, open 100°”; the generator then makes doors that actually fit and swing correctly.

Real stakes: Fast, accurate articulated assets power digital twins for robots (so they can practice safely), speed up content creation for games/AR/VR, and make simulations more realistic for training and testing. When objects move like the real world, everything from home assistants to factory arms gets smarter, safer, and cheaper to develop.

02Core Idea

🍞 Hook: Picture a teacher who can look at a cardboard model of a kitchen, tell you where the doors and drawers are, explain how they move, and then help you build the real thing.

🥬 The Concept: The key insight: Treat articulation as a language problem—write the object’s parts and joints as a sequence of tokens that a 3D LLM can read and generate.

How it works: Feed in a point cloud; the model autoregressively writes a structured “parts and joints” script; then a part generator turns the script into detailed geometry; a physics check fixes motion ranges.
Why it matters: This one process covers variable numbers of parts, different joint types, and precise motion in a single, stable pipeline, without slow per-object optimization or fixed part libraries.

🍞 Anchor: Like writing a recipe step-by-step (tokens), cooking each dish (parts), and then tasting to fix the seasoning (physics limit correction).

Three analogies:

Blueprint then build: First draw rooms and doors (part boxes and hinges), then furnish the rooms (geometry), and finally test the doors so they don’t hit walls (limit correction).
Music sheet then performance: Write notes (tokens for parts/joints), play them (generate parts), and tune the instrument (fix limits) so no strings buzz.
Lego plan then assembly: Choose blocks and connectors (parts and joints) by IDs, snap them in the right order (autoregression), then check movement (collision test) so creations don’t jam.

Before vs After:

Before: Either slow and fragile (optimize per object) or repetitive and limited (retrieve from a library). Geometry often ignored motion consistency.
After: A single, fast system that writes the articulation plan and then makes the matching geometry, with a final physics touch-up. Shapes look right and act right.

🍞 Hook: You know how guessing exact answers can be hard, but choosing from good multiple-choice options is easier?

🥬 The Concept: Quantization turns continuous numbers (like positions, angles) into tokens from neat bins, so the LLM can predict them reliably.

How it works: Coordinates, axes, and motion limits are mapped into discrete bins or a learned codebook; the LLM picks tokens instead of raw decimals.
Why it matters: Language models love tokens. This avoids tiny numeric wobble and keeps structure consistent across the sequence.

🍞 Anchor: Instead of saying “axis is (0.001, 0.999, 0.002)”, the model says “axis code 17,” which is a clean, stable choice.

Why it works (intuition):

Autoregression mirrors how we describe objects—first list parts, then explain how they connect and move.
Discrete tokens match LLM strengths: structure, grammar, and long-range consistency.
A dedicated axis codebook covers common hinge directions (like x, y, z) and less common ones, balancing precision and simplicity.
Multi-task, staged training first teaches the model to find parts well, then to reason about joints, making learning smoother.
A physics-based limit fix aligns the final motion with reality by catching collisions the single snapshot can’t reveal.

🍞 Hook: Imagine labeling boxes before packing a moving truck; labels keep things organized so nothing gets lost or jammed.

🥬 The Concept: Bounding boxes are labeled 3D boxes around each part that guide both joint prediction and part generation.

How it works: The model predicts boxes first; joints are then defined using these boxes; the generator builds parts that fit inside.
Why it matters: Boxes give a clean scaffold, so joints have correct parents/children and parts don’t overlap weirdly.

🍞 Anchor: Predict a door box on the right of a fridge, then place a hinge along the right edge; the door part is generated to fill that box and swing correctly.

03Methodology

High-level recipe: Input (image/text/mesh → point cloud) → 3D LLM writes a tokenized articulation blueprint (parts first, then joints) → Part-aware generator makes each part’s geometry → Physics-based limit correction → URDF asset ready for simulators.

Step A: Point cloud in, tokens out (part layout) 🍞 Hook: Think of sprinkling dots all over an object’s surface like stars outlining a constellation.

🥬 The Concept: A point cloud is a set of 3D dots (with normals) sampled from the object surface; a 3D encoder understands these dots.

How it works: Sample tens of thousands of points from the mesh. A Point Transformer extracts 3D features; a projector aligns them with the language model. The LLM then predicts a sequence of part bounding boxes as tokens (min/max corners in discrete bins), in a consistent order.
Why it matters: Boxes come first so joints can reference them; ordered boxes make training stable and outputs consistent.

🍞 Anchor: For a cabinet, the model might output three boxes: frame, left door, right door—each box given by six discretized coordinates.

Step B: Joints from parts (articulation blueprint) 🍞 Hook: After you know where the doors are, deciding where the hinges go is easy.

🥬 The Concept: Autoregressive joint prediction defines how parts connect: type (revolute/continuous/prismatic/screw), which parts they link, axis direction (from a codebook), pivot position, and motion limits (as bins).

How it works: The LLM continues the sequence—after a special separator, it emits joint tokens: joint type ID, parent/child box IDs, axis code, pivot tokens, and limit tokens.
Why it matters: Writing joints after boxes ensures axis placement and motion ranges are conditioned on the actual layout.

🍞 Anchor: For a right-side fridge door, the model writes: Revolute joint, parent=frame, child=rightDoor, axis=vertical (codebook ID near +y), pivot near the door’s right edge, range about a quarter turn.

Step C: Make parts look real (geometry synthesis) 🍞 Hook: A floor plan is helpful, but you still need furniture that fits.

🥬 The Concept: A part-aware generative model (e.g., XPart) creates high-fidelity meshes that match each predicted box.

How it works: Feed the predicted boxes into the generator to synthesize each part’s detailed geometry. If some object points sit just outside a box, slightly expand that box to capture all surface points so parts aren’t cut off.
Why it matters: This avoids reliance on a fixed part library and prevents broken geometry due to tight boxes.

🍞 Anchor: If a door’s handle points stick out, the box expansion step nudges the box so the generator includes the full handle.

Step D: Physics-based joint limit correction 🍞 Hook: When installing a door, you swing it to check it doesn’t hit the wall; if it does, you adjust the stopper.

🥬 The Concept: A collision-aware pass adjusts motion limits so moving parts don’t intersect others.

How it works: Temporarily move (sample) the child part across its predicted range and measure collisions against static parts. Where collisions start, clamp the limit. Repeat for sliders.
Why it matters: The model only saw a single snapshot; this step simulates motion to keep it physically plausible.

🍞 Anchor: If a dishwasher door would scrape the floor at 95°, the corrected limit clamps to around 90°.

Step E: Output to URDF for simulators 🍞 Hook: A recipe card you can hand to any chef is handy; URDF is that card for robots.

🥬 The Concept: URDF is a standard robot description format that lists links (parts) and joints.

How it works: Combine generated part meshes with the predicted joints and corrected limits, then export one URDF bundle per object.
Why it matters: Standard format means instant use in engines like SAPIEN and other robotics tools.

🍞 Anchor: The final fridge asset loads into a simulator; its doors open smoothly within safe angles and don’t collide.

The secret sauce:

Language of articulation: Casting parts/joints as tokens lets the LLM leverage structure and long-range dependencies.
Quantization + axis codebook: Predicting from bins/code IDs stabilizes numbers and prioritizes common axes while covering rare ones.
Multi-task, multi-stage training: First nail part layouts, then add joints—like learning to recognize objects before learning how they move.
Layout-first generation: A clear scaffold ensures geometry and motion align.
Physics check: Post-fix motion ranges based on actual collisions, not guesses.

Training data and strategy (brief):

Curated from PartNet-Mobility, PhysX3D, plus procedural Infinite-Mobility; filtered, normalized, and simplified for stable learning.
Stage 1: Train only on part layouts to ground 3D perception.
Stage 2: Train on layouts + joints (including a task that uses ground-truth boxes) to improve kinematic reasoning.
Data augmentations (scales, rotations) make the model robust to different object poses.

04Experiments & Results

🍞 Hook: Imagine a school contest where teams must build cabinets that not only look right but also open, close, and never bump into each other.

🥬 The Concept: The team tested ArtLLM on a standard benchmark (PartNet-Mobility) and compared it to strong baselines that reconstruct or retrieve parts.

How it works: Given test objects across seven categories (like refrigerator, dishwasher, oven), evaluate: part layout quality, joint type correctness, axis direction accuracy, pivot placement, motion range accuracy, and whether the whole part–joint graph matches.
Why it matters: Good scores mean objects look right, move right, and the overall structure makes sense for simulation and robotics.

🍞 Anchor: Think of grading: A for part boxes fitting well, A for hinges of the right kind and direction, A for where they’re placed, and A for how far they swing.

What they measured and why:

Part layout (mIoU): Do predicted boxes overlap the true parts well?
Joint type accuracy: Hinge vs slider vs screw—get the type right.
Axis and pivot errors: Is the door’s hinge vertical and in the right place?
Range IoU: Is the swing/slide range correct?
Graph accuracy: Does the parent–child structure match the truth?
Runtime: Is it fast enough for scaling up content creation?

Competition:

URDFormer: Retrieval-leaning pipeline; limited categories and simple structures.
SINGAPO: Single-image controlled generation with retrieval; fast but library-limited.
Articulate-Anything: Vision–language-based retrieval approach using a large VLM; strong on geometry retrieval, weaker on axis details.

Scoreboard (contextualized):

ArtLLM’s overall mIoU around 0.69 is like scoring an A on fitting parts, while some baselines hover closer to C–B.
Joint type accuracy around 0.91 shows it chooses the right kind of joint most of the time—like picking the correct tool for the job.
Joint range IoU near 0.74 means the allowed motion is broadly correct, and the physics post-check keeps it collision-free.
Graph accuracy around 0.77 means the overall connection map is mostly spot-on.
Runtime: ArtLLM is substantially faster than heavy optimization pipelines, enabling practical batch generation.

Surprising findings:

Even strong geometry-retrieval systems misplace axes or choose the wrong motion type more often than expected. This shows why a layout-first, language-driven articulation blueprint helps: it thinks about motion while planning geometry.
A simple box expansion trick meaningfully reduces cut-off parts, improving final realism.
A short physics-based limit pass fixes many real-world issues (like doors scraping) that aren’t obvious from a single snapshot.

Generalization:

Real images beyond the benchmark categories still yield coherent assets: retrieval systems can copy appearance well, but ArtLLM nails the motions more consistently.

Ablations (what matters most):

Predicting continuous numbers directly hurts performance; token bins are more stable.
Removing multi-task training reduces several metrics; learning layouts and joints together (with a curriculum) helps.
Skipping the first-stage layout pretraining weakens both box and joint quality; good geometry perception is a foundation for good articulation.

05Discussion & Limitations

Limitations:

Category breadth: Trained mostly on household-like items; performance dips on complex machines (vehicles, robots with many degrees of freedom).
Physics properties: The system refines limits by collisions but doesn’t yet predict full physical attributes (mass, friction, damping) jointly with geometry and joints.
Hidden internals: Single-image reconstructions can miss inner structures (like racks inside an oven), so generated parts may be too simple or intersect internally.
Tight overlaps: Highly overlapping parts are still tricky for the generator, which may produce intersections.

Required resources:

A point-cloud/mesh encoder, a compact 3D LLM backbone, GPU resources (multi-GPU helpful) for training; at inference, a single modern GPU suffices.
Access to a part-aware generator like XPart or OmniPart.
Basic physics/collision checking for final limit refinement and a simulator that reads URDF.

When NOT to use:

Very high-DOF robots or vehicles with complex linkages where precise kinematics and dynamics are critical (e.g., industrial robot arms with torque limits), unless the system is extended with physics-aware training.
Cases needing precise internal mechanical tolerances (gears, cams) or exact material properties.
Scenarios where you must reconstruct fully accurate internals from a single image.

Open questions:

Can we learn physical attributes (mass, friction, compliance) along with geometry and joints so motion not only avoids collisions but also feels right dynamically?
How to handle occluded internal structures from sparse inputs—multi-view cues, generative priors for hidden parts, or language-guided hallucination with uncertainty?
Can an open-vocabulary articulation model handle unusual categories (bikes, strollers, drones) without special retraining?
Could the axis codebook and token bins adaptively refine themselves per object for even sharper accuracy?

06Conclusion & Future Work

Three-sentence summary:

ArtLLM treats articulated object creation as a language task: it writes a tokenized blueprint of parts and joints from a point cloud, then builds detailed parts and fixes motion limits with a physics check.
This unifies geometry and motion in one fast pipeline, outperforms strong baselines on part layout and joint accuracy, and generalizes well to real-world inputs.
The result is simulation-ready URDF assets that look right and move right—great for games, AR/VR, and robot learning.

Main achievement:

Casting articulation as token generation with a 3D LLM—plus stable quantization, a smart axis codebook, and a layout-first plan—bridges the geometry–motion gap at scale.

Future directions:

Add physics-aware learning (mass, friction, compliance) and better reasoning about hidden internals.
Expand to open-vocabulary categories and higher-DOF systems (vehicles, robots) with richer kinematic templates.
Tighten the loop between blueprint prediction, geometry synthesis, and physics so each step adapts to the others.

Why remember this:

ArtLLM shows that when we make motion part of the plan—not an afterthought—3D worlds become truly interactive. It’s a practical path to high-quality digital twins and scalable robot learning, turning pretty shapes into useful, believable tools.

Practical Applications

•Rapidly create simulation-ready appliances (ovens, fridges, dishwashers) for robotics manipulation training.
•Generate consistent, interactive props for games and AR/VR without manual rigging.
•Build digital twins from a single product photo to test usability and motion ranges.
•Prototype furniture with correct doors/drawers for virtual showrooms and configurators.
•Augment robot datasets with diverse, realistic articulated assets for robust policy learning.
•Auto-rig 3D marketplace assets to standard formats (URDF) for instant simulator integration.
•Perform quick what-if motion checks (e.g., door swing clearance) during design.
•Create interactive educational demos of mechanisms (hinges, sliders, screws) for STEM classes.
•Accelerate virtual testing of household assistants by populating homes with accurate articulated items.
•Support film/previs teams with fast-turnaround, motion-correct assets for scene blocking.

Version: 1