TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics
Key Summary
- •Robots learn better when they get small hints at every step instead of only a final thumbs-up or thumbs-down.
- •TOPReward turns a vision-language model’s hidden confidence about 'Did this finish the task? True or False' into a smooth progress score from 0 to 1.
- •It never asks the model to print numbers (which models are bad at); it reads the model’s inside probabilities (which are much more reliable).
- •Across 130+ real robot tasks on many arms, TOPReward tracks progress very well (about 0.947 VOC on Qwen3-VL), beating a popular baseline that collapses on open-source models.
- •It also doubles as a success detector and improves behavior cloning by weighting better moments more strongly.
- •A new benchmark, ManiRewardBench, tests progress sensitivity, success detection, and cross-robot robustness with stage annotations.
- •Simple choices matter: using the single token 'True' and avoiding chat templates are key to strong results.
- •TOPReward needs zero extra training—it's a training-free way to unlock reward signals already hidden inside pretrained video VLMs.
- •In real-world tests, using TOPReward for advantage-weighted learning boosted robot success where plain behavior cloning lagged.
- •This suggests many open-source VLMs are already 'robotics-ready' if we read their beliefs instead of their words.
Why This Research Matters
Robots in homes, warehouses, and hospitals need steady guidance to learn reliably, but hand-made rewards and endless labels don’t scale. TOPReward finds dense, trustworthy feedback already hiding inside pretrained video VLMs, so we can skip costly training and still get strong signals. This makes open-source models far more useful for real-world robot learning without depending on proprietary systems. It also enables automatic success detection and smarter imitation learning, saving time and human effort. Ultimately, it speeds up the path to capable, general-purpose robots that can adapt to new tasks and environments with minimal setup.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: You know how a teacher doesn’t just give you one grade at the very end of the semester? They give you feedback on homework, quizzes, and projects so you can improve along the way.
🥬 The Concept — Reinforcement Learning (RL):
- What it is: RL is a way for robots (and software) to learn by trying things and getting feedback called rewards.
- How it works:
- The robot takes an action.
- It sees what happened (the new state) and gets a reward.
- It repeats this to learn which actions lead to better rewards over time.
- Why it matters: Without clear, frequent rewards, the robot can’t tell which steps helped and which didn’t, so learning becomes very slow.
🍞 Anchor: Imagine learning basketball if the only feedback you got was at the end of the season. That’s hard! Frequent coaching tips make you better much faster.
🍞 Hook: Think of a treasure hunt where the app only beeps when you’re exactly on the treasure. If it’s silent all the way until the end, you wander forever.
🥬 The Concept — Sparse Rewards:
- What it is: Sparse rewards are when the robot only gets a reward at the very end (success/failure), with little or no feedback in between.
- How it works:
- Robot tries a long, multi-step task.
- It gets a reward only if it finishes correctly.
- All the middle steps feel like guesswork.
- Why it matters: This makes real-world robot learning painfully slow and sample-inefficient.
🍞 Anchor: It’s like a maze game that only tells you “You won!” or “You lost!” at the very end—no arrows, no hints.
🍞 Hook: Imagine a robot that can watch videos, read instructions, and act—like a student who can see, read, and move.
🥬 The Concept — Vision-Language-Action (VLA) Models:
- What it is: VLA models connect what a robot sees (vision), what it’s told (language), and what it does (action).
- How it works:
- Look at camera frames (video).
- Read the instruction (like “put the pen in the cup”).
- Choose actions that match the instruction and the scene.
- Why it matters: VLAs promise general robots that can follow many tasks with words, but they still need good rewards to fine-tune and improve.
🍞 Anchor: If you tell a VLA “stack the red cube on the green cube,” it can plan motions using both sight and words.
🍞 Hook: Have you ever tried to guess someone’s progress at a puzzle by peeking at different moments?
🥬 The Concept — Progress Estimation (Temporal Value Function):
- What it is: A progress estimator is a score that should increase as a task moves toward completion.
- How it works:
- Watch earlier-to-later parts of a task.
- Estimate how close it is to done at each step.
- Make the score smoothly rise as the task proceeds.
- Why it matters: This gives dense, step-by-step feedback—exactly what RL needs for faster learning.
🍞 Anchor: In “fold the towel,” reaching, grasping, folding, and finishing should look like 10%, 40%, 80%, 100%.
The world before: People tried two main paths. First, hand-crafting rewards for each task—time-consuming and brittle when the environment changes. Second, training reward models from big robot datasets (like RoboDopamine or RoboReward). Those learned to score progress or success, but they needed tons of data and didn’t always transfer to new robots or camera angles.
The problem: We need general, fine-grained rewards that work across many tasks and robots without collecting fresh labels every time.
🍞 Hook: You know how some people explain math answers better than they can actually write the numbers neatly?
🥬 The Concept — VLMs vs. Number Outputs:
- What it is: Vision-language models (VLMs) often understand videos well, but are clumsy at printing precise numbers.
- How it works:
- Ask a VLM to output a number like 0.73.
- It might produce weird formatting or be poorly calibrated.
- The numeric text can be unreliable, even if the model “knows” the answer internally.
- Why it matters: If you rely on printed numbers for rewards, you may get junk signals, especially on open-source models.
🍞 Anchor: A student might know the solution but write it messily—grading the handwriting instead of the idea would be unfair.
Failed attempts: A popular training-free method, GVL, asks the VLM to output numeric progress for shuffled frames. It works on some proprietary models (like Gemini/GPT-4), but collapses on many open-source VLMs. That made people think open-source models just weren’t ready for robotics rewards.
The gap: Maybe these models do understand progress—but the number-printing step is the problem! What if we read their inner confidence instead of their outer words?
🍞 Hook: Imagine checking a friend’s face to see how sure they are, rather than listening to their shaky explanation.
🥬 The Concept — Token Probabilities (Logits):
- What it is: Inside the model, each next word/token gets a probability—how confident the model is about saying it.
- How it works:
- Feed in the video and a question.
- The model assigns probabilities to possible next tokens.
- Higher probability = stronger internal belief.
- Why it matters: These inside scores are often more honest and stable than the final printed text.
🍞 Anchor: If asked “Is this task complete?,” a big internal probability for “True” means the model believes the job is done.
Real stakes: If we can get solid, training-free progress signals from open-source VLMs, robots can:
- Learn faster with RL in homes, warehouses, and hospitals.
- Detect success automatically to save human labeling time.
- Improve imitation learning by rewarding the best moments.
- Generalize across new tasks without collecting fresh data.
This paper’s answer, TOPReward, uses the VLM’s hidden confidence about “True” (for “task completed”) to create a smooth, reliable progress curve—no number-printing needed, no extra training required, and it works across 130+ real robot tasks on multiple robot arms.
02Core Idea
🍞 Hook: Picture a thermometer that rises as you get closer to finishing a chore—no need to ask for an exact temperature, you just watch it climb.
🥬 The Concept — The Aha!: Use the model’s hidden probability that the instruction is True to measure how much of the task is done as time goes on.
- What it is: TOPReward turns a VLM’s inside belief about “Did this complete the task? True/False” into a clean progress score.
- How it works:
- Show the model the instruction and a video prefix (the first part of the attempt).
- Ask a simple completion question that expects True/False.
- Read the internal probability of the token “True.”
- Repeat for longer prefixes to trace a progress curve that rises over time.
- Why it matters: This avoids fragile number-printing, is zero-shot (no training), and works well on open-source models.
🍞 Anchor: In “put the pen in the cup,” early frames might get low “True” probability, but as the pen gets closer and then drops in, the probability ramps up.
Three analogies for the same idea:
- Applause meter: The crowd’s cheers (probability of “True”) grow as the performer nears the finale (task completion).
- Detective’s hunch: As clues stack up (video frames), the detective’s confidence (“True”) rises.
- Loading bar: Each subtask adds a chunk; the bar fills smoothly without requiring you to type exact percentages.
Before vs. After:
- Before: People thought open-source VLMs couldn’t do progress estimation well without training, because asking them to output numbers led to messy results.
- After: By reading internal beliefs instead of printed numbers, open-source VLMs suddenly look very capable—delivering smooth, accurate progress curves across many tasks and robots.
🍞 Hook: Imagine you ask, “Are we there yet?” many times during a road trip—and the navigator gives a confidence score each time.
🥬 The Concept — Temporal Value from Internal Belief:
- What it is: A value that increases over time if you’re on track to finish the task, derived from “True” token probability.
- How it works:
- Sample multiple time points along a trajectory.
- For each prefix, compute log-probability of “True.”
- Normalize within the episode to map to 0–1.
- Use the increments between steps as dense rewards.
- Why it matters: A monotonic, well-behaved signal that policies can learn from right away.
🍞 Anchor: During “fold the towel,” the curve stays low while reaching, rises at grasp, bumps again at the fold, and maxes near the end—matching real sub-steps.
Why it works (intuition, not equations):
- VLMs store rich world knowledge and video understanding in their internal activations.
- Numeric text outputs are a noisy last step; the inner probabilities are often better calibrated and more faithful to what the model really “thinks.”
- A single-token answer like “True” avoids formatting issues and concentrates belief into one place.
- As visual evidence accumulates across time, the “True” probability naturally rises, yielding a smooth progress curve.
Building blocks (with mini-sandwiches):
-
🍞 Hook: You know how yes/no questions are easier than asking for exact percentages. 🥬 The Concept — Binary Completion Query:
- What it is: Ask the model “Does this trajectory complete the task? True or not.”
- How it works: Provide instruction + video prefix → read probability of “True.”
- Why it matters: A simple, robust probe of the model’s internal belief. 🍞 Anchor: “Place doll in box?” gets low “True” early, high “True” near the drop.
-
🍞 Hook: Think of a confidence dial that turns up when evidence appears. 🥬 The Concept — Token Probabilities (Logits → Probabilities):
- What it is: Internal scores that become probabilities over tokens.
- How it works: The model weighs possible next tokens; “True” gets a probability.
- Why it matters: This bypasses unreliable numeric text. 🍞 Anchor: If the pen is inside the cup, “True” spikes.
-
🍞 Hook: Checking progress every few seconds during a race. 🥬 The Concept — Prefix Sampling:
- What it is: Evaluate multiple early-to-late slices of the video.
- How it works: Pick K time points; compute “True” probability at each.
- Why it matters: Builds a temporal curve of progress. 🍞 Anchor: At 10%, 40%, 70%, 100% of the video, the curve rises.
-
🍞 Hook: Rescaling grades from any class to a 0–100 scale. 🥬 The Concept — Min–Max Normalization (Per Episode):
- What it is: Map raw log-probabilities into a 0–1 progress score.
- How it works: Subtract the minimum, divide by the range, add a tiny epsilon for stability.
- Why it matters: Gives a clean, comparable curve within each trajectory. 🍞 Anchor: A towel-folding attempt always gets a 0–1 curve regardless of lighting or camera zoom.
-
🍞 Hook: Rewarding effort bumps—extra points when real progress happens. 🥬 The Concept — Dense Reward Increments:
- What it is: Use increases in progress between steps as per-step rewards.
- How it works: Compute exponential weighting of deltas, then clip to avoid giant spikes.
- Why it matters: Teaches policies to favor actions that actually move the task forward. 🍞 Anchor: Lifting the cube a bit gets a small reward; placing it on the target gets a larger one.
-
🍞 Hook: Sometimes fancy letterheads make a simple message confusing. 🥬 The Concept — Avoid Chat Templates:
- What it is: Don’t wrap prompts in heavy chat formatting.
- How it works: Feed a simple, direct prompt closer to pretraining’s next-token prediction.
- Why it matters: Keeps signals stable; templates hurt performance in ablations. 🍞 Anchor: The same question, asked plainly, gives more reliable “True” probabilities.
Bottom line: The key leap is to read belief, not print numbers. That single switch lets open-source models shine as zero-shot reward models for robotics.
03Methodology
At a high level: Instruction + Video → (Ask True/False on prefixes) → Read “True” probabilities → Normalize to 0–1 curve → Compute dense rewards if needed.
Step-by-step (with mini-sandwiches):
-
🍞 Hook: Imagine you show just the beginning of a movie and ask, “Is the mystery solved yet?” 🥬 The Concept — Input Preparation:
- What it is: Pair a natural-language instruction (like “put the pen in the cup”) with a video trajectory of frames.
- How it works:
- Collect a trajectory τ1:T (frames in time order).
- Keep the instruction x exactly as given.
- Choose K prefix lengths t1 < … < tK covering early to late steps.
- Why it matters: Sampling prefixes lets us see progress forming over time, not just at the end. 🍞 Anchor: For “fold the towel,” we might probe frames at 10%, 30%, 60%, 90% of the video.
-
🍞 Hook: Asking a yes/no question is easier than asking for a decimal. 🥬 The Concept — The Prompt (No Chat Template):
- What it is: A simple True/False completion query, close to next-token prediction.
- How it works:
- Build a direct prompt: “The above video shows a robot trajectory that completes: {INSTRUCTION}. Decide if that is True or not. The answer is: {a}”
- Use a single affirmative token a = “True”.
- Don’t wrap in chat templates (ablation shows performance drops when you do).
- Why it matters: Keeps the model’s internal belief easy to read and less noisy. 🍞 Anchor: “Does this video complete ‘put the cube in the cup’? The answer is: True.”
-
🍞 Hook: Think of a thermometer reading at each time point. 🥬 The Concept — Reading Token Probability as Reward:
- What it is: Use the log-probability of “True” as a raw reward r_t for each prefix.
- How it works:
- For each prefix τ1:t, feed video + instruction to the VLM.
- Compute log p(“True” | context of that prefix).
- Collect r_t across all sampled t.
- Why it matters: It captures how the model’s confidence grows as evidence accumulates. 🍞 Anchor: Early in towel folding, r_t is very negative (low probability); later, it approaches 0 (high probability).
-
🍞 Hook: Converting any grading system to a 0–100 scale. 🥬 The Concept — Per-Episode Normalization:
- What it is: Turn raw log-probabilities (−∞ to 0] into a 0–1 progress score s_t.
- How it works:
- Find min and max r_t within this trajectory.
- s_t = (r_t − min) / (max − min + ε), small ε for stability.
- Now s_t is guaranteed in [0, 1].
- Why it matters: A clean, comparable curve within each attempt. 🍞 Anchor: Different lighting or camera views won’t break the curve’s 0–1 shape inside each episode.
-
🍞 Hook: Rewarding the actual moments where things get done. 🥬 The Concept — Dense Per-Step Rewards (Optional for Learning):
- What it is: Translate increases in s_t into per-step advantages for training policies.
- How it works:
- Compute Δ_t = clip(τ · exp(s_t − s_{t-1}), min=0, max=δ_max).
- τ (tau) controls how much extra weight good steps get.
- δ_max caps huge spikes so training stays stable.
- Why it matters: Policies learn to favor actions that truly push the task forward. 🍞 Anchor: Lifting the cube off the table adds a small Δ_t; placing it in the box adds a larger Δ_t.
-
🍞 Hook: Choosing the clearest green light. 🥬 The Concept — Picking the Right Token (Why “True”):
- What it is: Use the single-token word “True” as the affirmative answer.
- How it works:
- Test candidate tokens (e.g., “True”, “Yes”).
- Measure which separates success vs. failure best at the end.
- “True” shows the largest, most consistent gap.
- Why it matters: Maximizes signal quality for both progress and success detection. 🍞 Anchor: Final frames of a completed attempt give the strongest “True” boost.
-
🍞 Hook: Don’t add extra wrapping paper that confuses the gift. 🥬 The Concept — Avoiding Chat Templates:
- What it is: Keep prompts minimal because templates reduced performance in tests.
- How it works:
- Use a bare prompt and the model’s video-text inputs.
- Proprietary APIs that force templates can hurt VOC.
- Open-source backbones without templates perform best.
- Why it matters: Staying close to pretraining’s next-token prediction objective keeps logits meaningful. 🍞 Anchor: The same question, without fancy packaging, gets clearer answers.
-
🍞 Hook: Turning curves into decisions. 🥬 The Concept — Success Detection:
- What it is: Decide if an attempt succeeded using the final (or last few) “True” log-probabilities.
- How it works:
- Average log p(“True”) over last few prefixes.
- Compare across episodes; higher means more likely success.
- Evaluate with ROC-AUC (how well it separates success/failure).
- Why it matters: A practical, label-free way to flag successful demos and filter datasets. 🍞 Anchor: Two towel-folding videos—one finishes, one stalls. The finisher shows a clearly higher final “True.”
-
🍞 Hook: Using better “grades” to train better students. 🥬 The Concept — Reward-Aligned Behavior Cloning (AWR-style):
- What it is: Weight imitation learning by advantages derived from TOPReward, so good moments count more.
- How it works:
- Compute Δ_t from s_t for each state-action.
- Subtract the mean to get advantages.
- Fine-tune the policy with advantage-weighted losses.
- Why it matters: Improves real-world performance over plain behavior cloning. 🍞 Anchor: On six SO-100 tasks, advantage-weighted fine-tuning beats the baseline, sometimes by a lot.
Concrete running example (Fold the Towel):
- Early prefixes: low “True” probability → s_t near 0.
- Grasp achieved: s_t jumps toward 0.4–0.6.
- First fold: s_t rises toward 0.8–0.9.
- Final tidy placement: s_t approaches 1.0.
- The Δ_t spikes at grasp and fold guide the learner to reproduce these key moves.
Secret sauce:
- Read the model’s beliefs, don’t force it to print numbers.
- Use a single, well-separated token (“True”).
- Sample prefixes to form a temporal curve, then normalize per episode.
- Avoid chat templates that muffle the signal.
- Convert progress bumps into dense rewards for learning pipelines.
04Experiments & Results
🍞 Hook: If a new bike is really better, you don’t just say it—you race it against others and time the laps.
🥬 The Concept — The Test (What we measured and why):
- What it is: Measure how well predicted progress follows real time using Value-Order Correlation (VOC), and how well final scores detect success using ROC-AUC.
- How it works:
- VOC checks if the model’s scores rise in the same order as the video frames (Spearman rank correlation).
- ROC-AUC checks how well the method separates successful and failed attempts.
- Why it matters: VOC = “Does the curve go up smoothly over time?” ROC-AUC = “Can we tell if it actually finished?”
🍞 Anchor: A good curve climbs like a staircase across the task; a good detector says “this one finished; that one didn’t.”
The competition: GVL (Generative Value Learning) is a training-free baseline that asks the model to print numeric progress for shuffled frames. It does okay on some proprietary models but struggles badly on open-source ones.
Benchmarks:
- Open X-Embodiment (OXE) subset: 39 datasets, 20 episodes each (diverse robots/cameras).
- ManiRewardBench: 130+ real-world manipulation tasks across multiple robot platforms (Franka, single-arm/bimanual YAM, SO-100/101), with subtask stage annotations.
Backbones tested:
- Open-source: Qwen3-VL-8B, Molmo2-8B.
- Proprietary: Gemini-2.5-Pro (logit access available).
Scoreboard with context:
- On OXE:
- TOPReward with Qwen3-VL-8B: VOC ≈ 0.857 (like going from a shaky C to a strong A-), vs. GVL ≈ 0.194.
- TOPReward with Molmo2-8B: VOC ≈ 0.417, vs. GVL ≈ −0.016 (a flip from slightly worse-than-guessing to decently positive).
- On Gemini-2.5-Pro: GVL ≈ 0.541, TOPReward ≈ 0.433 (here templates muddy TOPReward’s signal; see ablation).
- On ManiRewardBench:
- TOPReward with Qwen3-VL-8B: VOC around 0.942–0.954 across datasets; mean ≈ 0.947—smooth, consistent progress tracking across robots.
- GVL often near zero or negative on open-source models, showing its numeric output formulation falters where TOPReward’s logit probe thrives.
Surprising findings:
- Chat templates degrade performance a lot on open-source models (nearly 50% drop in VOC on Qwen3-VL in ablations). Keeping prompts bare matters.
- VOC can hide failures: a trajectory that rises then plateaus early can still have high VOC. TOPReward’s direct probability of “True” naturally assigns lower final scores to incomplete attempts, making it better for success detection.
🍞 Hook: Think of a lie detector that doesn’t care about smooth talking—only about the speaker’s true heartbeat.
🥬 The Concept — Success Detection (Binary classification):
- What it is: Use final “True” probabilities to guess if a trajectory actually finished.
- How it works:
- Average last few log p(“True”).
- Score ROC-AUC: higher is better separation of success vs. failure.
- Why it matters: A practical tool for filtering datasets and judging outcomes without labels.
🍞 Anchor: On failure splits, TOPReward with Qwen3-VL-8B beats GVL (ROC-AUC ≈ 0.654 vs. 0.519—GVL is near random), while on Gemini both are strong.
Real-world policy improvement (SO-100 tasks):
- Setup: Start with a pretrained policy, collect 50 demos per task (noisy), compute TOPReward values per step, convert to advantages, and fine-tune with advantage-weighted regression.
- Results: Consistent gains over behavior cloning across six tasks. On some tasks, success jumps from failing baselines to near-perfect (e.g., “Place doll in box” from 0 to 10/10 partial success score). This shows TOPReward’s progress bumps are usable, not just pretty curves.
Takeaways:
- Training-free, zero-shot progress estimation is possible and strong with open-source models—if we read logits, not printed numbers.
- Success detection works naturally from the same signal.
- The method delivers real robot improvements when plugged into learning pipelines.
- Prompt formatting (no chat templates) and token choice (“True”) are simple but crucial details.
05Discussion & Limitations
Limitations:
- Visual detail limits: If the VLM can’t see fine-grained cues (like tiny object alignment), the progress estimate can be noisy or flat at key moments.
- Per-episode normalization: Since we min–max within each trajectory, raw 0.9 on one video isn’t directly comparable to 0.9 on another without care—though final “True” probabilities can still support cross-episode success detection.
- Backbone dependence: Performance rises or falls with the underlying video VLM’s quality. Better video models should directly boost TOPReward.
- Prompt sensitivity: Chat templates reduced performance in ablations; APIs that force templates can hurt results.
Required resources:
- A video-capable VLM with access to token probabilities/logits.
- Enough compute to run K forward passes per video (one per sampled prefix). K is a trade-off: more samples, smoother curve, more compute.
- Simple prompting setup (no heavy chat wrappers) and care in token selection (use the single-token “True”).
When NOT to use:
- Ultra-fast, on-robot feedback loops with very tight latency budgets where multiple forwards per step are infeasible.
- Tasks whose success hinges on tiny, nearly invisible changes (e.g., millimeter-precision insertions) if your VLM can’t visually resolve them.
- Situations where APIs force templates or block access to logits—this undermines TOPReward’s main advantages.
Open questions:
- Calibration across episodes: Can we replace per-episode normalization with a global calibration that keeps absolute scores comparable without losing sensitivity?
- Multi-token or richer targets: Are there other token choices (besides “True”) or small sets of tokens that boost robustness even more?
- Temporal modeling: Could smarter sampling (adaptive prefixes) reduce compute while keeping curves smooth?
- Safety: How can we wrap TOPReward with safeguards so misreads (e.g., occlusions) don’t mislead robots in safety-critical settings?
- Beyond manipulation: Does the same idea transfer to navigation, locomotion, or human-robot interaction videos as easily?
Honest assessment: TOPReward is a simple, powerful trick—read belief, don’t force numbers—that unlocks strong zero-shot rewards from open-source VLMs. It’s not magic: quality depends on the backbone and careful prompting. But given how easy it is to plug in, and how well it works across many tasks, it’s a practical tool to speed up robot learning today while we build even better video models tomorrow.
06Conclusion & Future Work
Three-sentence summary: TOPReward turns a video VLM’s hidden probability that an instruction is “True” into a reliable, rising progress curve over time. By reading beliefs instead of forcing numeric text, it delivers zero-shot, training-free rewards that work across 130+ real robot tasks and multiple arms—outperforming prior training-free methods on open-source backbones. The same signal powers success detection and improves imitation learning through advantage-weighted fine-tuning.
Main achievement: Showing that open-source video VLMs already contain general, usable reward signals—if we look at token probabilities instead of printed numbers—achieving about 0.947 mean VOC on ManiRewardBench with Qwen3-VL and strong results on OXE.
Future directions:
- Global calibration for absolute progress comparability across episodes.
- Smarter time sampling to cut compute.
- Exploring richer token sets and ensemble probes for even more robustness.
- Extending to other domains (navigation, multi-agent tasks) and pairing with ever-stronger video backbones.
Why remember this: It’s a simple, elegant shift—belief over words—that flips the script on zero-shot rewards for robotics. With virtually no training and minimal engineering, TOPReward enables dense, meaningful feedback in the real world, helping robots learn faster, detect success automatically, and generalize across tasks and embodiments.
Practical Applications
- •Automatic success detection for robot demos to filter and curate datasets with minimal human labeling.
- •Reward-aligned behavior cloning that weights the most productive moments to improve policy fine-tuning.
- •Training-free evaluation of task progress during real-world rollouts for early stopping or recovery strategies.
- •Curriculum generation by ranking demonstrations from easiest (high early progress) to hardest (late progress).
- •Cross-embodiment dataset selection by picking trajectories with clear progress signals across different robot arms/cameras.
- •On-robot monitoring to detect when a task is likely completed and trigger the next instruction in a sequence.
- •Benchmarking and QA of new robot datasets using zero-shot progress traces to spot anomalies or mislabeled episodes.
- •Selecting starting states for RL by preferring prefixes with rising progress to boost sample efficiency.
- •Active data collection loops that keep demonstrations with strong final 'True' signals and discard low-value ones.
- •Post-hoc scoring of legacy videos to identify high-quality examples for training future policies.