PhyCritic: Multimodal Critic Models for Physical AI

Tianyi Xiong; Shihao Wang; Guilin Liu; Yi Dong; Ming Li; Heng Huang; Jan Kautz; Zhiding Yu

PhyCritic: Multimodal Critic Models for Physical AI

Intermediate

Tianyi Xiong, Shihao Wang, Guilin Liu et al.2/11/2026

arXiv

Key Summary

•PhyCritic is a judge model that checks other AI models’ answers about the physical world, like cooking steps, robot actions, or driving choices.
•Its key idea is simple: before judging others, it solves the problem itself and then uses its own answer as a yardstick.
•It learns in two stages: a warm-up that strengthens physical perception and reasoning, and a special self-referential training that links judging to its own prediction.
•The training uses verifiable rewards so the model gets a “treat” only when its answers and judgments are truly correct and well-formatted.
•A new test set, PhyCritic-Bench, focuses on physical AI tasks from robotics and driving, making judgment accuracy measurable and fair.
•On this benchmark, PhyCritic beats all open-source 7B/8B models by a large margin and even helps at test time to pick the best answer from many candidates.
•Surprisingly, even though it is trained for physical tasks, PhyCritic also judges well on general visual tasks like captioning and Q&A.
•It is data-efficient: only about 4,000 training samples and 380 RL steps were used, yet performance improved a lot.
•Ablations show both stages are needed, and the self-referential step is the main driver of gains.
•Main limitation: it needs ground-truth answers during training, which is harder for fully open-ended tasks.

Why This Research Matters

Robots and assistants increasingly act in the real world, where mistakes can waste time or be unsafe. PhyCritic helps ensure that AI suggestions are not just well worded but physically correct, grounded in what the camera sees and what cause-and-effect allows. This improves everyday tasks like cooking with a smart assistant and high-stakes activities like driving decisions. It also raises the quality of training signals for other models, leading to better policies without needing huge datasets. Because it is data-efficient and transfers to general judging, it offers a practical route to safer, smarter AI systems. In short, it upgrades AI from sounding right to being right in the physical world.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re helping your friend bake cookies. You look at the oven, the tray, and the timer. You don’t just read the recipe—you also check what’s really happening. That mix of seeing and thinking is what makes your advice useful.

🥬 The Concept (Visual Perception): What it is: Visual perception in AI means the model can look at images or videos and notice important details (like where the tray is or whether the oven door is closed). How it works:

The model looks at pixels and turns them into features (shapes, edges, objects).
It recognizes objects (tray, oven, vegetables).
It links objects to places (tray inside oven, door closed). Why it matters: Without solid perception, any advice would be guesswork—like telling someone to push a tray that’s already inside. 🍞 Anchor: In the oven example, perception lets the AI see the tray is already in the oven and the door is closed.

🍞 Hook: You know how you predict what happens next in a game of Jenga—pulling a loose block won’t topple the tower, but pulling a tight one might? That’s cause and effect.

🥬 The Concept (Causal Reasoning): What it is: Causal reasoning is understanding how one action leads to another. How it works:

Notice the current setup (objects and their states).
Imagine possible actions (press start, push tray, open door).
Predict outcomes (oven heats, tray moves, food cooks).
Choose the action that matches the goal and the setup. Why it matters: Without it, AI might suggest impossible or silly steps—like turning on the oven before closing the door. 🍞 Anchor: Seeing a closed oven with food inside, causal reasoning picks “press start” as the next step.

🍞 Hook: Think of labels you put on classroom projects—gold star if it follows the rules and works as promised.

🥬 The Concept (Benchmarking Methods): What it is: Benchmarks are fair tests that compare models using the same tasks and answer keys. How it works:

Collect questions with images/videos and correct answers.
Ask models to answer.
Score them consistently. Why it matters: Without benchmarks, we can’t tell if a model is actually good or just lucky. 🍞 Anchor: A driving benchmark can ask, “Do you stop or go?” and check against safe driving rules and ground truth.

🍞 Hook: Imagine training a puppy with treats: sit gets a treat; jumping on guests gets none.

🥬 The Concept (Reinforcement Learning): What it is: A way for models to learn behaviors that earn rewards. How it works:

The model tries something.
It gets a reward (good) or not (bad).
It updates itself to get more rewards next time. Why it matters: Without rewards, the model doesn’t know which behaviors to keep. 🍞 Anchor: If the model correctly says “press start,” it gets a reward and learns to do that again in similar scenes.

🍞 Hook: Think of a judge at a science fair who explains which project wins and why.

🥬 The Concept (Critic Models): What it is: Critic models evaluate other models’ answers, giving scores, preferences, and explanations. How it works:

Read the question and see the visual input.
Compare two candidate answers.
Check reasoning quality and final correctness.
Pick the better one and explain why. Why it matters: Without good critics, models can drift toward flashy but wrong answers. 🍞 Anchor: Between “push tray in further” and “press start,” a good critic picks “press start” because the tray is already in place.

🍞 Hook: Picture robots that can stack blocks, open drawers, and follow safe driving rules. That’s more than labeling pictures—it’s doing smart things in the world.

🥬 The Concept (Physical AI): What it is: AI that sees, reasons, and plans actions in the real world. How it works:

Perceive the scene (objects, positions, states).
Use causal reasoning (if I do X, Y happens).
Plan steps toward a goal (sequence of safe, doable actions). Why it matters: Without physical AI, assistants can’t move from talking to doing. 🍞 Anchor: A kitchen robot must see the oven, understand that doors must be closed before heating, and then press start.

The world before PhyCritic: Multimodal models could describe images and answer simple questions, but judging whether an answer was physically correct (not just nicely written) was weak. Critics trained on captions or trivia missed physical impossibilities, like suggesting a robot push a tray that was already fully inside or turn left across solid traffic.

The problem: We needed a judge that understands physical setups, cause-and-effect, and plans—so it can tell if a response is both visually grounded and causally valid.

Failed attempts: General judges did well on style and surface correctness but:

Lacked physics awareness, so they might approve impossible steps.
Were trained on broad data, not on embodied tasks like manipulation or driving.
Didn’t ground their verdicts in their own understanding, leading to inconsistent decisions.

The gap: A critic should “solve before judging.” If a judge forms its own physics-aware answer first, it can fairly compare others to that reference.

Real stakes: Better critics mean safer robot assistants, clearer driving decisions, and more reliable planning in homes, hospitals, and factories. When AI plans your next action with a hot oven or a busy road, the difference between correct and incorrect guidance truly matters.

02Core Idea

🍞 Hook: You know how a good referee in sports understands the game so well that they can almost predict the next play before blowing the whistle? That’s why their calls feel fair.

🥬 The Concept (Multimodal Critic Models): What it is: A multimodal critic model looks at pictures or videos plus text and judges which answer is better and why. How it works:

Read the question and watch the scene.
Examine two candidate answers and their reasoning.
Check truthfulness, visual grounding, logic, and accuracy.
Choose the better response and justify the decision. Why it matters: Without a strong critic, models can pass off convincing-sounding but physically wrong steps. 🍞 Anchor: In a driving clip, the critic rejects “overtake into oncoming traffic” and favors “wait” when the scene demands caution.

Aha! moment in one sentence: Teach the critic to first solve the problem itself and then use that solution as a stable reference when judging others.

Three analogies:

Coach analogy: A coach demonstrates a drill first (self-prediction), then evaluates players by comparing to the correct form.
Recipe analogy: A chef cooks a dish first (the reference), then judges other dishes by how close they match the right taste and sequence.
Science fair analogy: A judge builds a mini-version in their head, then checks projects against that expected outcome.

🍞 Hook: Think of getting a treat only when the trick is clearly, verifiably correct—no guessing allowed.

🥬 The Concept (Reinforcement Learning with Verifiable Rewards, RLVR): What it is: A training method where the model is rewarded only when its outputs match ground-truth answers or follow strict formats that can be checked. How it works:

Present a question with a known correct answer.
Let the model answer.
Give reward 1 if correct, 0 if not (and small bonus if the format is right).
Adjust the model to increase future rewards. Why it matters: Without verifiable rewards, the model might learn to sound smart instead of being right. 🍞 Anchor: The model says “press start” when the door is closed and tray is inside; the verifier says correct, so it earns a reward.

Before vs after:

Before: Critics often judged by style or partial correctness; they could miss physical impossibilities.
After: With self-referential training, the critic anchors its verdicts to its own physics-aware prediction, making decisions more consistent and grounded.

Why it works (intuition):

Anchoring: Producing its own answer creates a strong internal reference, reducing swayed-by-style mistakes.
Causal grounding: Solving first forces the model to simulate the scene and check cause-and-effect.
Double reward: The model is rewarded for both solving correctly and judging correctly, tying the two skills together.

🍞 Hook: Imagine a student who first solves a math problem, then grades classmates’ work by comparing it to their own correct solution.

🥬 The Concept (Self-Referential Critic Finetuning): What it is: A training step where the critic must first write its own reasoning and answer, then explicitly use that answer to judge two candidates. How it works:

Self-prediction: Answer the question yourself.
Critic decision: Compare Response A vs B using your answer as a reference.
Rewards: Get points for a correct self-answer and for picking the right response.
Format reward: Small bonus for following the thinking→answer→judging structure. Why it matters: Without this, judgments can be inconsistent and ungrounded. 🍞 Anchor: In the oven task, the critic first infers the door is closed and tray is in; it then favors “press start” over “push tray in further.”

Building blocks of PhyCritic:

Physical skill warm-up: Strengthen perception and causal reasoning with RLVR on physical Q&A.
Self-reference critic stage: Make the judge solve first, then judge.
Clear rubric: Truthfulness, visual grounding, logical validity, efficiency, and final accuracy.
Data diversity: Robotics and driving videos, with paired responses and verifiable labels.
Data efficiency: Only thousands of samples, but large gains.

Result: A critic that is not only a fair judge but also a capable solver in physical AI settings.

03Methodology

At a high level: Input (question + image/video) → Stage 1 (Physical skill warm-up with RLVR) → Stage 2 (Self-referential critic finetuning) → Output (which response is better, with explanation).

Step-by-step details:

Inputs

What happens: The model receives a multimodal prompt: a question plus relevant images or video clips. It also receives two candidate responses (A and B) to compare.
Why this exists: Judging requires context (visuals) and concrete options to evaluate.
Example: Question: “Goal: Cook vegetables with the oven. What’s the next subtask?” Responses: A) “Press the start button.” B) “Push the tray further in.”

Stage 1: Physical Skill Warm-Up (RLVR) 🍞 Hook: Like practicing scales on a piano before performing a duet. 🥬 The Concept (Warm-Up with RLVR): What it is: The model first practices answering physical questions directly, getting rewards only when correct. How it works:

It answers many physical Q&A items with known answers.
It gets a reward of 1 for correct, 0 for wrong.
It learns to see better and reason more causally. Why it matters: Without this foundation, the critic stage has shaky ground. 🍞 Anchor: The model learns that with food inside and door closed, “press start” is correct.

Implementation notes: The team uses GRPO (a PPO-style algorithm) to update the model without a separate value network by comparing groups of sampled outputs and normalizing rewards. 🍞 Hook: Imagine rating a group of student solutions at once and using the best and worst to guide feedback. 🥬 The Concept (GRPO): What it is: A reinforcement method that computes advantages by comparing a group of sampled outputs, stabilizing learning. How it works:

Sample multiple answers/judgments.
Score each with verifiable rewards.
Compute relative advantages within the group.
Update the policy to prefer higher-scoring samples. Why it matters: Without stable updates, training can wobble or collapse. 🍞 Anchor: From five sampled answers, the ones that are verifiably correct pull the model toward better behavior.
Stage 2: Self-Referential Critic Finetuning

What happens: Now the model must both solve and judge in one go.
Steps:

Self-prediction: The model writes its own reasoning and final answer (its internal best guess).
Critic decision: It compares Response A vs B, explicitly referencing its own answer.
Rewards:
- Self-prediction reward (rsp): 1 if its own answer matches ground truth.
- Critic reward (rcrit): 1 if it picks the response that matches the ground-truth preference.
- Format reward (rform): Bonus for following the structured output (think → pred → compare → final box).

Why this exists: It forces the judge to anchor decisions in real physics-aware understanding.
Example with data: Question: “Cook vegetables with the oven—what’s next?” Self-prediction: “The tray is inside and door is closed; next is ‘press start.’” Compare A (“press start”) vs B (“push tray in”): Choose A.

Output and Explanation

What happens: The model outputs its own answer (for transparency), then a detailed comparison citing specific visual and logical reasons, and finally a strict decision (A or B is better).
Why this exists: A judge should explain its call and tie it to evidence.
Example: “Response A matches the needed next step given the oven door is closed; Response B suggests an unnecessary action.”

The Secret Sauce:

Solve-before-judge anchor: By first answering itself, the model creates a reliable internal reference that reduces style-based bias.
Dual rewards: Tying correctness of solving and judging together encourages consistent, physics-aware reasoning.
Format shaping: A small format reward nudges consistent, clear reasoning structure, improving stability.

Training specifics (friendly recap):

Base model: Qwen2.5-VL-7B-Instruct.
Stage 1: 80 RL steps on Cosmos-Reason1-RL physical Q&A.
Stage 2: 300 RL steps on paired response data with labels.
Reward weights: self-prediction 0.2, critic 0.7, format 0.1.
Batch size 128, learning rate 1e-6, mild KL control to keep updates safe.

Why each step matters:

Without Stage 1, the critic may judge without understanding physics.
Without Stage 2, the model won’t tie its own understanding to its judgments, causing inconsistency.
Without format constraints, outputs can drift and be hard to evaluate.

End-to-end example (condensed):

Input: Driving clip with a stopped truck; Question: “What should you do next?”
Stage 1 skill: Understand lanes, obstructions, safety.
Stage 2 self-prediction: “Continue forward in your lane (A).”
Judge A vs B: Prefer A if it matches the safe, grounded self-prediction and the ground truth.
Output: Explanation citing specific visual cues (truck position, clear right lane).

04Experiments & Results

The Test: What did they measure and why?

They measured how often the critic’s choice matched ground-truth preferences on physical tasks (pairwise accuracy).
They also checked if skills transfer to general visual judging and whether the model itself became a better solver.

🍞 Hook: Like a sports tournament where judges must call the right winner over and over.

🥬 The Concept (PhyCritic-Bench): What it is: A new test built for judging in physical AI—robotics and driving—using videos, tough questions, and paired responses with verified labels. How it works:

Gather robotics and driving clips (RoboVQA, Bridge V2, HoloAssist, AgiBot, RoboFail, LingoQA).
Create questions requiring perception and causal logic.
Pair a correct and incorrect response; ask the judge to pick the better one. Why it matters: Without a physical-domain benchmark, critics could look good on easy image tasks but fail on real-world reasoning. 🍞 Anchor: The test includes steps like “grasp vs nudge,” “wait vs overtake,” making causal correctness measurable.

The Competition: Baselines included strong open-source 7B/8B models (Qwen2.5-VL-7B, Eagle-2.5-8B) and physical-reasoning models (Cosmos-R1-7B, RoboBrain2.0-7B), plus proprietary models (GPT-4o, Gemini 2.5 variants).

The Scoreboard (with context):

On PhyCritic-Bench, PhyCritic scored 68.0% accuracy. Think of this as getting a solid A when others got around a C to B-: Qwen2.5-VL-7B (51.6), Eagle-2.5-8B (56.0), RoboBrain2.0-7B (54.7), Cosmos-R1-7B (51.1).
Sub-suites: It led or tied on AgiBot (78.8), HoloAssist (65.5), RoboVQA (86.7), and performed strongly on Bridge-v2 (65.6). It generalized to RoboFail (57.4) and LingoQA (60.0) too.
General judging: Despite being trained for physical tasks, PhyCritic improved over base Qwen on VL-RewardBench (+4.1) and Multimodal-RewardBench (+1.9). That’s like a soccer goalie who is great at penalties also blocking regular shots better.
As a solver: On Cosmos-Reason1-Bench, PhyCritic hit 63.9% overall, slightly above Cosmos-R1-7B (63.0), and performed very well on CV-Bench (79.7 overall; best 3D at 83.9) and ranked second on EgoPlanBench2 (42.3 overall).

Surprising findings:

Data efficiency: Only 4,058 samples and 380 RL steps yielded sizable gains—proof the two-stage, verifiable approach is powerful.
Self-prediction matters: Statistical tests showed that when PhyCritic’s own answer was correct, its judgments were much more likely to be correct—confirming “solve before judge.”
Test-time helper: Using PhyCritic as a knockout judge to select the best of N sampled answers improved a base model by up to +6.5 points at N = 32.
Training helper: Using PhyCritic to create preference pairs for DPO training made a policy model noticeably better than using only answer correctness.

Ablations (what changed when parts were removed):

Two-stage needed: Stage 1 alone boosted solving but not judging much; Stage 2 gave big judging gains; together they worked best across all metrics.
No self-reference: Removing the self-referential step dropped physical judging by 3.6 points and hurt transfer.
Reward weights and prompt criteria: Performance was robust to small reward weight changes, but removing judging criteria in the prompt made results clearly worse, showing structured guidance helps stabilize critic behavior.

Takeaway: PhyCritic is not just better at judging; it also makes other models better—both by selecting the best answer at test time and by training them with stronger preference signals.

05Discussion & Limitations

Limitations:

Needs ground-truth answers during training: The self-prediction reward depends on a known correct answer, which is hard for fully open-ended tasks.
Domain coverage: While diverse, the data centers on robotics and driving; other physical domains (e.g., sports biomechanics) could need additional tuning.
Format dependency: Some stability comes from a specific reasoning format; different formats might require re-tuning.
Computation: Although data-efficient, RL finetuning still needs GPU time and careful hyperparameters (KL, sampling group size).

Required resources:

A base multimodal model (e.g., a 7B VLM), RL finetuning tools, and datasets with verifiable answers and paired responses.
Enough compute for group sampling and GRPO updates, though the step count is modest.

When NOT to use:

Fully open-ended judging without any verifiable answers or reliable preference labels.
Purely stylistic or creative writing tasks where physical correctness is irrelevant.
Scenarios with mismatched modalities (e.g., audio-only tasks) unless extended.

Open questions:

Can we replace ground-truth answers with self-verification or environment feedback to scale beyond answerable prompts?
How far does the solve-before-judge idea generalize to other domains like medicine or education, where causality matters but visuals differ?
Could multi-round self-refinement further stabilize judgments (judge → reflect → revise)?
What is the best balance of self-prediction vs critic reward across different datasets and modalities?
Can we design lighter prompts or implicit criteria while keeping the same stability and accuracy?

Big picture: PhyCritic shows that critics should think like solvers. The combination of verifiable rewards and self-referential judging creates a steadier, more physically aware evaluation process that helps both judging and solving.

06Conclusion & Future Work

Three-sentence summary:

PhyCritic is a multimodal critic for physical AI that first solves a problem itself, then uses that solution to judge other answers.
It learns in two RLVR-based stages: a physical skill warm-up and a self-referential critic finetuning with rewards for both correct self-answers and correct judgments.
It achieves state-of-the-art results among open-source 7B/8B critics on physical tasks, transfers to general judging, and even improves policy performance and training.

Main achievement:

Turning “solve before judging” into a practical, reinforced training pipeline that grounds judgments in physics-aware reasoning and makes critics reliably consistent.

Future directions:

Reduce reliance on ground-truth answers using self-verification or environment feedback.
Expand to more physical domains and modalities, and explore multi-round self-refinement.
Streamline prompts while keeping stability, and study optimal reward weightings across tasks.

Why remember this:

PhyCritic reframes judging as a grounded act, not a style check. It shows that when an AI judge first builds its own understanding of a physical scene, its decisions become clearer, safer, and more useful—and that same grounding can help other models both at test time and during training.

Practical Applications

•Robot task auditing: Check if a robot’s next action sequence is physically valid before execution.
•Kitchen assistants: Verify safe and correct step ordering for cooking tasks (close door → press start).
•Autonomous driving evaluators: Judge safer maneuvers in tricky traffic scenarios.
•Factory workflow checkers: Validate assembly actions and tool use plans against affordances.
•Best-of-N answer selection: Use PhyCritic to pick the best solution among many sampled outputs.
•Self-improving training: Generate stronger preference pairs for DPO or RLHF pipelines in physical tasks.
•Failure analysis: Explain why a plan is wrong (e.g., pushing an already-in tray) with grounded evidence.
•Educational simulators: Teach causal reasoning with visuals by comparing student solutions against a reference.
•Household robotics planning: Evaluate plans for chores like loading dishwashers or folding clothes.
•Safety gating: Filter out risky or physically impossible recommendations before they reach users.

Version: 1