RynnBrain: Open Embodied Foundation Models
Key Summary
- •RynnBrain is an open-source 'robot brain' that helps machines see, think, and plan in the real world across space and time.
- •It unifies four big skills in one model: first-person (egocentric) understanding, finding things in space and time, reasoning tied to real physics, and planning steps that include exact locations.
- •Instead of guessing with only words, RynnBrain mixes words with coordinates (points, boxes, and paths) so its answers are physically grounded.
- •A new 'Chain-of-Point' reasoning style lets the model think step by step while pointing to evidence in frames, which cuts down on hallucinations.
- •RynnBrain comes in three sizes (2B, 8B, 30B-A3B MoE) and three tuned variants for reasoning, navigation, and manipulation/action control.
- •Across 28 benchmarks, it beats prior embodied models by clear margins and keeps strong general vision skills on images and videos.
- •A carefully built data flywheel (model + human loop) scales training to ~20M samples covering OCR, spatial grounding, planning, grasping, and trajectories.
- •RynnBrain-Bench is a new, rigorous test set for fine-grained, spatio-temporal understanding from egocentric videos.
- •Physics-aware plans that include target areas and affordances make downstream robot controllers more reliable and precise.
- •The project is fully open (code, models, data, benchmarks), aiming to be a foundation for general-purpose embodied intelligence.
Why This Research Matters
Robots that can understand where things are and how they move can finally help with real chores—like safely fetching items, organizing desks, or assisting in hospitals. By putting exact coordinates into plans, RynnBrain makes instructions that downstream controllers can actually execute, reducing trial-and-error and accidents. This can save time and improve safety in warehouses, factories, and homes. Its open release means schools, startups, and labs can build on a strong, shared foundation rather than starting from scratch. Better grounded reasoning also supports accessibility tools—like smart assistants that can read labels and place items exactly where someone needs them. Over time, this approach points toward trustworthy AI teammates that operate in our world, not just talk about it.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: You know how learning to ride a bike is different from just reading about bikes? Riding needs balance, timing, and reacting to the world in front of you—not just facts.
🥬 The Concept (Embodied Intelligence): Embodied intelligence means AI that not only understands with eyes and ears (cameras and microphones) but also thinks and acts in the physical world. How it works: 1) It takes in what it sees and hears from its own point of view, 2) reasons about what to do next, 3) then plans and moves its body (like a robot arm or wheels). Why it matters: Without embodiment, AI can talk about the world but can’t safely and reliably do things in it.
🍞 Anchor: A kitchen robot must spot the fridge, know the door opens outward, reach the handle, and pull—at the right time and place.
The world before: Over the last few years, big AI models that read images and videos (multimodal foundation models) became great at naming objects, describing scenes, and answering questions. They learned from huge amounts of internet pictures and clips, so they knew a little about almost everything. But they weren’t built to act. They could say, “That’s a fridge,” but not reliably decide where to grab it or when to pull the handle.
🍞 Hook: Imagine a coach who knows all the rules of soccer but has never actually run on the field. Good at talking, not at playing.
🥬 The Concept (Multimodal Foundation Models): These are models trained to understand many data types at once—like text, images, and videos—so they can answer visual questions and hold conversations about what they see. How it works: 1) A vision encoder turns pictures or video frames into tokens, 2) a language model understands the question and the visual tokens together, 3) it generates a text answer. Why it matters: Without multimodal learning, the model can only guess about images from text patterns.
🍞 Anchor: Ask, “What is the person doing in this video?” and it answers, “Stirring soup with a spoon.”
The problem: When we tried to use those models as a “robot brain,” three issues showed up.
- Narrow egocentric skills: Many models learned from third-person, static images, so they struggled with first-person views that move and blur and with tasks that need constant updating.
- Missing spatio-temporal grounding: They could point to an object in one frame but lost track across time—hard for navigation or manipulation where “where” and “when” matter together.
- Text-only planning: High-level plans written only in words often ignored physics (like reachability, sizes, or collisions), causing hallucinations or impossible instructions for robot controllers.
Failed attempts: Teams tried bolting tools onto language models—zooming into image regions or calling object detectors—hoping better vision would fix planning. Others trained action-centric systems that learned to move but forgot rich language and general knowledge. Both sides hit limits: either great at words but weak at doing, or good at a few actions but brittle outside the training chores.
The gap: We needed one unified “brain” that keeps the broad understanding of big vision-language models, but is built around physical space and time, and talks to controllers using concrete locations—points, boxes, and paths—not only prose.
Real stakes: In daily life, a helpful home robot must safely navigate hallways, read labels on pantry items, choose correct handles, and place objects correctly—without bumping into pets or dropping dishes. In warehouses or hospitals, seconds saved and mistakes avoided are huge. In education and accessibility tech, a physically grounded assistant can turn instructions into dependable help. That’s why this research matters: it aims to make AI that doesn’t just describe the world but cooperates with us inside it.
02Core Idea
🍞 Hook: Imagine giving a friend directions like, “Go past the couch, grab the blue mug by its handle, and set it on the table’s right edge.” If your friend only heard the words but didn’t know exactly where those places were, it would be chaos.
🥬 The Concept (RynnBrain): RynnBrain is an open robot brain that unifies seeing, thinking, and planning across space and time, and it outputs both words and exact coordinates so plans match the real world. How it works: 1) It takes images or videos from the robot’s own view, 2) builds a memory over time, 3) reasons step by step, 4) and outputs a mix of text plus locations—points, boxes, paths—so downstream controllers can act safely. Why it matters: Without mixing language with coordinates, plans drift into fiction; with grounding, they become executable.
🍞 Anchor: For “Open the fridge,” RynnBrain points to the handle in a specific frame and then lists steps like, “Grip here (x,y), pull along this path.”
Multiple analogies for the key idea:
- GPS for thoughts: Instead of saying, “Turn soon,” it says, “Turn right at 30 meters,” combining instruction and coordinates.
- Sheet music for robots: Not just lyrics (text) but notes (points/paths) that a performer (controller) can play exactly.
- Recipe with pictures and arrows: Each step shows the ingredient and an arrow to where to cut, stir, or pour.
Before vs. After:
- Before: Text-only reasoning led to hallucinations like “grab the nonexistent handle.” Object finding was often per-frame with no long memory. Planning ignored affordances and exact placements.
- After: RynnBrain ties reasoning to evidence in frames (via Chain-of-Point), keeps temporal memory across videos, and writes physics-aware plans that include affordance points, areas, and trajectories.
Why it works (intuition, not math): The world is geometric. Plans that include geometry—where and when—naturally fit the laws of physics. By discretizing coordinates into tokens, the model “speaks” geometry just like it speaks words. By interleaving thinking with pointing, it forces each reasoning step to stay honest—anchored to visible facts. And by training on rich egocentric, OCR, counting, and planning data, it learns the patterns of real homes and hands, not just web pictures.
Building blocks (each introduced with a quick sandwich):
🍞 Hook: You know how when you wear a GoPro, the world shifts as you move—things get closer or farther, left or right. 🥬 The Concept (Egocentric Understanding): Understanding from the robot’s own moving viewpoint. How it works: process video frames, keep short-term memory, answer “what’s here, now?” Why it matters: Without it, the robot treats a changing scene like a static photo and gets confused. 🍞 Anchor: “How many chairs are around me?” The model counts from your current view.
🍞 Hook: Imagine a comic strip: each panel shows where characters are and when things happen. 🥬 The Concept (Spatiotemporal Localization): Finding who/what/where across space and time. How it works: 1) Pick the key frame, 2) mark objects (boxes), areas (points), or paths (trajectories). Why it matters: Without time-aware locating, you lose track of moving targets and can’t plan multi-step actions. 🍞 Anchor: “Where is the table at 10 seconds?” The model picks frame 10s and boxes the table.
🍞 Hook: If a box is heavy and on a high shelf, you don’t yank it straight down—you think about gravity and leverage. 🥬 The Concept (Physically Grounded Reasoning): Thinking that respects real-world physics and evidence in the video. How it works: alternate between text reasoning and pointing to proof. Why it matters: Without grounding, the model might suggest impossible moves. 🍞 Anchor: “Which is higher, object A or B?” It points to each object in frames and compares heights.
🍞 Hook: When you plan to pour milk, you choose the handle, tilt angle, and path to the cereal bowl. 🥬 The Concept (Physics-aware Planning): Making plans that include exact affordance points, areas, and paths. How it works: identify where to grasp, where to move along, where to place. Why it matters: Without precise points and paths, a controller can’t execute safely. 🍞 Anchor: “Pick up the tissue box at its center and place it on the wooden sofa at these coordinates.”
🍞 Hook: Solving a mystery is easier if you can circle clues on each page as you think. 🥬 The Concept (Chain-of-Point Reasoning): A step-by-step thinking style that mixes words with grounded points/boxes in specific frames. How it works: think → point → think → point, so every step stays tied to evidence. Why it matters: Without pointing, thoughts drift; with pointing, they stay real. 🍞 Anchor: “Count the drawers near the cabinet.” It marks each drawer’s box before giving the final count.
In short, the aha: Treat coordinates like a language. Make the model speak locations as naturally as it speaks sentences. Then, force each reasoning step to cite its visual source. That unlocks plans robots can actually carry out.
03Methodology
At a high level: Input (egocentric images/videos + instruction) → Build spatio-temporal memory → Interleave reasoning with grounding (points/boxes/paths) → Output both text and spatial tokens for downstream control.
Step A: Unified spatio-temporal inputs
- What happens: Images (T=1) and videos (T>1) are treated the same way: a sequence of frames, each turned into visual tokens with time-position hints so the model knows the order.
- Why it exists: Without a unified view, the model would need separate logic for images and videos and would struggle to learn motion patterns.
- Example: For a 12-second clip sampled at 2 FPS, the model sees 24 ordered frames to answer “Where was the mug at 10 seconds?”
Step B: A shared output space that includes coordinates
- What happens: The model doesn’t only output words. It also outputs discrete coordinate tokens for points (x,y), bounding boxes (x1,y1,x2,y2), and trajectories (a sequence of points). All are normalized to 0–1000 so it “speaks” locations.
- Why it exists: Words like “near the left side” are vague for a robot arm. Coordinates tell the arm exactly where to go.
- Example: “Open the fridge” includes the handle point (870,390) and the fridge box ((405,14),(805,480)).
🍞 Hook: You know how a graph paper grid lets you say exactly where to place a sticker. 🥬 The Concept (Coordinate Tokens): Turning real-world positions into a small set of integer tokens (0–1000) that the model can generate like words. How it works: normalize positions to the grid and generate them as part of the answer. Why it matters: Without tokens, positions would be fuzzy text; with tokens, the plan is precise. 🍞 Anchor: “Place the cup at (330, 890).” Those numbers are tokens the model emits.
Step C: Training with next-token prediction
- What happens: The model learns to guess the next token (word or coordinate) from everything it has seen so far (frames + text + earlier tokens).
- Why it exists: This simple recipe works for language models and now works for mixed text+location sequences.
- Example: Given “Pick up the blue box…,” the model predicts which frame to point to, then the path tokens.
Step D: Efficient infrastructure for long, varied sequences
- What happens: Videos and long dialogues create very uneven sequence lengths. The team balances the load across GPUs by assigning harder (longer) samples smartly, and uses a clever per-sample loss so workers don’t need extra syncing.
- Why it exists: Without this, training stalls when one worker gets a super-long video.
- Example: Sorting sequences by length and dealing them to keep all workers equally busy.
Step E: Architecture choices that fuse vision and language
- What happens: RynnBrain is built on Qwen3-VL with a vision encoder, a projector, and a decoder-only LLM. It uses DeepStack to handle many visual tokens and interleaved positional encodings so time and space line up well.
- Why it exists: This keeps general vision-language strength while adding precise grounding.
- Example: The same backbone can describe a frame and also emit a box for “the red mug by the stove.”
Step F: Physics-aware pretraining data
- What happens: A huge, varied data mix (~20M samples) teaches object attributes, spatial relations, counting, OCR in egocentric videos, object/area/affordance/trajectory/grasp localization, and planning with coordinates.
- Why it exists: Real-world reliability needs diverse, realistic, and physically meaningful cases.
- Example: OCR clips from Ego4D ask, “Which brand is on the cup near the stove?” and the model reads “CHASE.”
🍞 Hook: When you use a drawer, you don’t grab any random spot—you pull at the handle. 🥬 The Concept (Affordance): The best place on an object to interact (like a handle or button). How it works: the model learns to predict an affordance point to act on. Why it matters: Without affordances, the robot might push the wrong area and fail. 🍞 Anchor: “Push the drawer at (630, 602).” That’s the handle’s point.
Step G: Chain-of-Point (CoP) reasoning post-training
- What happens: After pretraining, RynnBrain-CoP learns to interleave thoughts and pointers: it writes a short reasoning step, then grounds it by pointing to areas or boxes in specific frames.
- Why it exists: This keeps reasoning honest and reduces hallucination—every step must show where it looked.
- Example: “I should count drawers near the nightstand [points to three exact boxes], so answer: 3.”
Step H: CoP fine-tuning + reinforcement learning
- What happens: First, supervised training teaches the interleaving style. Then GRPO (a PPO-like method) nudges the model toward better grounded outputs using rewards that measure trajectory shape, affordance coverage, and whether area points lie inside the right region.
- Why it exists: Rule-based rewards make the model prefer physically correct grounding.
- Example: A predicted path that deviates a lot gets lower reward than a closely matched path.
Step I: Post-trained specialists
- RynnBrain-Nav: fine-tuned for vision-language navigation (VLN) in simulators, predicting low-level moves (forward, turn, stop) from long egocentric video histories.
- RynnBrain-Plan: a high-level planner that outputs subtasks plus grounding (boxes, points, paths) in multi-turn dialogues to keep memory.
- RynnBrain-VLA: turns grounded plans into low-level robot actions using a flow-matching (diffusion-like) setup, while reusing the same conversation format so pointing info stays aligned.
The secret sauce:
- Treat coordinates as a first-class language. The model “talks” in points and paths as easily as words.
- Interleave thinking and pointing. Each reasoning step must show evidence in a specific frame.
- Train on rich egocentric and physics-aware data so patterns match real homes, hands, and objects.
Result: A single brain that can watch, remember, reason, and plan with the exactness that robots need.
04Experiments & Results
🍞 Hook: If you want to know who’s fastest in a race, you don’t just read the times—you ask, “Fast compared to whom, and on what track?”
🥬 The Concept (RynnBrain-Bench): A new, carefully checked test that asks models to understand egocentric videos and to ground answers in exact space and time. How it works: It measures four pillars—Object Cognition, Spatial Cognition (ego- and world-centric), Grounding (find the frame + box), and Pointing (areas, affordances, trajectories), with strict metrics. Why it matters: Without a tough, fair test, we can’t tell if a model truly understands and localizes in the real world.
🍞 Anchor: The benchmark asks things like, “Which is nearer to you?” or “Locate the mug at frame 3,” or “Draw the path to move the bag to the shelf,” grading both correctness and precise localization.
The test: RynnBrain is evaluated across 28 benchmarks, including 20 embodied tasks and 8 general vision tasks. The suite covers spatial reasoning (like VSI-Bench), egocentric question answering (EgoTaskQA), OCR in videos (EgoTextVQA), open-robot VQA (Open-X VQA), grounding tasks (RefSpatial-Bench), and manipulation targets (grasp datasets), plus video understanding tests (MVBench, EgoSchema, VideoMME). RynnBrain-Bench adds fine-grained, spatio-temporal checks with about 3,600 clips and 12,000 curated questions.
The competition: Compared against strong models such as RoboBrain 2.0, Pelican-VL, MiMo-Embodied, InternVL3.5, Qwen3-VL, and proprietary systems (Gemini 3 Pro, GPT-5.2).
The scoreboard (with context):
- Spatial cognition leaps: On VSI-Bench, RynnBrain-8B gets around 71, where many prior methods hover around the 50–60s. Think of 71 as an A when most others got B’s.
- Egocentric reasoning: On EgoTaskQA and Open-X VQA, RynnBrain-30B (A3B) lifts scores by hefty margins (e.g., +10.5% on EgoTaskQA vs. strong baselines), signaling better first-person understanding.
- Localization wins: On RefSpatial-Bench and affordance tasks, RynnBrain variants lead or tie for top results, and grasping benchmarks show clear gains—like raising the Cornell-Grasp score substantially. That’s like not just recognizing a cup but also choosing the perfect grasp angle.
- Grounded reasoning: RynnBrain-CoP-8B, with Chain-of-Point thinking, beats both open and closed baselines across affordance, area, and trajectory prediction—averaging about 73.8 vs. ~65 for competitors. That’s like winning not only the sprint but also the hurdles and relay, all in one meet.
- Navigation strength: On R2R-CE and RxR-CE (val-unseen), RynnBrain-Nav-8B posts top-tier Success Rate and lowest Navigation Error, showing it follows long instructions in new buildings better than most. Oracle Success is especially high, indicating strong coarse path following—even if perfect stopping still needs polish.
- General vision stays strong: Despite its embodied focus, RynnBrain stays competitive on AI2D, MVBench, and InfoVQA, sometimes hitting state of the art. So it didn’t trade away general smarts to get good at robotics.
Surprising findings:
- Thinking with points beats bigger models: CoP’s interleaved pointing lets a compact 8B model outscore larger baselines on spatial tasks. Evidence-first thinking matters more than just adding parameters.
- Area prediction is hardest: Even top models lag on free-form area localization; RynnBrain still leads but shows room to grow.
- MoE scaling isn’t automatic: The 30B MoE (with fewer active parameters per step) doesn’t always beat an 8B dense model in early navigation, hinting that sparse expert routing may need task-specific training tricks.
- Stop decisions in VLN: Very high Oracle Success but lower final Success reveals that precise terminal stopping is a key remaining challenge.
Bottom line: Across cognition, localization, reasoning, and planning, RynnBrain moves the bar up—often by a full letter grade—while keeping robust general vision skills. The new benchmark confirms these gains are not just on paper but in strict, spatio-temporal tests.
05Discussion & Limitations
Limitations (be specific):
- Precise area prediction under clutter: Free-form regions (like “clear space on the table”) remain challenging when surfaces are busy or partially occluded.
- Terminal actions in navigation: High Oracle Success but lower final Success means deciding exactly when and where to stop still needs work.
- Long-horizon memory under extreme length: While the context is large, very long videos with many subgoals can still strain consistency without external memory modules.
- Hardware and data demands: Training with long videos and mixed outputs (text + coordinates) needs multiple GPUs, careful load balancing, and curated egocentric data—still a barrier for small labs.
- Real-to-real transfer: Even with diverse data, moving from one home/lighting arrangement to another can trip models up; more domain adaptation may be needed.
Required resources:
- GPUs with enough memory to handle long sequences and video tokens; distributed training with ZeRO and gradient checkpointing.
- Access to egocentric video datasets, OCR clips, grounding annotations, and a planning corpus with affordances and trajectories.
- For robotics, a simulator (e.g., Habitat) and a real robot arm (e.g., Franka) with teleop tools for collection and evaluation.
When NOT to use it:
- Pure text tasks: If you don’t need vision or action, a lightweight LLM is simpler and cheaper.
- Ultra-precise control loops at millisecond scale: RynnBrain outputs high-level, grounded plans; ultra-fast servo loops still belong to a dedicated controller.
- Out-of-sensor-range tasks: If the key evidence isn’t visible (or sensors are poor), grounded reasoning will struggle, and hallucination risks rise.
Open questions:
- How to add robust world models: Can we learn simple physics simulations inside the model to better predict future states and collisions?
- Memory modules: What’s the best way to store and recall long episodic histories beyond the current context window?
- Social and safety reasoning: How to encode rules like “don’t block exits,” “avoid sharp tools near people,” or “respect privacy” into grounded plans?
- Better area understanding: Can we unify segmentation-quality region predictions with the tokenized coordinate language to improve free-form area tasks?
- MoE for embodiment: How do we route experts by skill (navigation vs. manipulation) without losing coordination across tasks?
In short, RynnBrain makes a big step, but precise areas, perfect stopping, and memory across extra-long activities remain active frontiers.
06Conclusion & Future Work
Three-sentence summary: RynnBrain is an open, spatio-temporal foundation model that blends vision, language, and coordinates so robots can understand, reason, and plan in the physical world. By treating locations as tokens and interleaving thought with pointing, it turns vague text plans into physics-aware, executable steps. Across 28 benchmarks, it sets new standards in embodied cognition, grounding, and planning while keeping strong general vision ability.
Main achievement: Unifying text and geometry in one language—so the model can “speak” points, boxes, and paths alongside words—and proving that Chain-of-Point reasoning makes smaller models outthink bigger ones on spatial tasks.
Future directions: Add stronger long-term memory and predictive physics, sharpen free-form area understanding, improve final-stop decisions in navigation, and explore expert routing that mirrors human skills (seeing, balancing, grasping). Expand the open data flywheel with more diverse homes, hands, and tools, and connect RynnBrain to broader agent stacks (memory, cerebellum-like control, sensor hubs).
Why remember this: RynnBrain shows how to make AI plans that are not just smart but place-and-time exact—a recipe robots can truly follow. It reframes grounding as part of language, anchors thoughts to visual evidence, and opens the door to reliable helpers in kitchens, hospitals, warehouses, and classrooms.
Practical Applications
- •Home assistance: Find, grasp, and place household items with precise, physics-aware steps.
- •Warehouse picking: Identify correct products, choose safe grasp points, and plan collision-free paths.
- •Hospital logistics: Deliver supplies by navigating hallways and placing items at exact drop-off zones.
- •Retail inventory: Read labels (OCR), count stock on shelves, and localize restocking areas.
- •Elderly care: Safely fetch personal items, open doors or drawers, and place objects within reach.
- •Education: Teach robotics students how language, vision, and action tie together using open models.
- •AR guidance: Provide step-by-step, grounded instructions that overlay points and paths on live video.
- •Inspection and maintenance: Point out target components, indicate affordances (levers, valves), and plan action sequences.
- •Kitchen automation: Read packaging, locate handles, and execute precise pour-and-place tasks.
- •Navigation aides: Follow long, natural language directions in new buildings with reliable stopping behavior.