PyVision-RL: Forging Open Agentic Vision Models via RL

Shitian Zhao; Shaoheng Lin; Ming Li; Haoquan Zhang; Wenshuo Peng; Kaipeng Zhang; Chen Wei

PyVision-RL: Forging Open Agentic Vision Models via RL

Intermediate

Shitian Zhao, Shaoheng Lin, Ming Li et al.2/24/2026

arXiv

Key Summary

•PyVision-RL teaches vision-language models to act like curious agents that think in multiple steps and use Python tools to inspect images and videos.
•It fixes a common failure called interaction collapse, where models stop using tools and give short, weak answers after RL training.
•Two training ideas drive the gains: an accumulative tool reward that pays for useful multi-turn tool use, and an oversampling–filtering–ranking rollout pipeline that keeps only informative, stable learning examples.
•For videos, PyVision-Video loads the whole video only into the Python runtime and fetches frames on demand, which slashes visual token usage while keeping accuracy high.
•On image tasks, PyVision-Image sets new state-of-the-art results across visual search and multimodal math benchmarks.
•On video tasks, PyVision-Video beats strong agentic and non-agentic baselines, using about 5K visual tokens instead of ~45K while staying more accurate.
•Training is made steadier by removing a noisy normalization term in GRPO and by filtering broken tool runs before learning.
•The approach is fully open-weight and uses a single, unified recipe for both images and videos.
•Results show that sustained interaction and on-demand visual processing are key to scalable multimodal agents.
•Careful sandboxing is needed because the Python tool can access files if not restricted.

Why This Research Matters

Many real-world problems need careful, step-by-step investigation, not just a quick guess. PyVision-RL shows how to keep models curious and grounded: reward useful steps, stabilize training data, and only look at the parts of a video that matter. That means better answers with fewer tokens, which saves money and speeds up responses. It also empowers open-weight communities to build practical, safe, and efficient agents for education, healthcare imaging, and security video review. By turning models into steady investigators, this approach scales to bigger tasks while staying cost-aware. In short, it’s a blueprint for making AI assistants that act more like thoughtful scientists than hurried guessers.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re solving a big jigsaw puzzle. You don’t just stare at the whole pile once. You pick up a piece, zoom in, try it, look again, and keep interacting until the picture makes sense. That steady back-and-forth is how people solve tricky problems.

🥬 Filling (The Actual Concept – Reinforcement Learning): What it is: Reinforcement learning (RL) is a way for AI to learn by trying, getting rewards for good outcomes, and adjusting its behavior, like a puppy learning tricks with treats. How it works:

The model tries a plan (like cropping an image, plotting frames, or doing math).
It gets feedback (reward) if the final answer is correct, and possibly more reward if the plan used helpful tools well.
It updates its strategy to do more of what gets rewards. Why it matters: Without RL, models often talk nicely but don’t get better at taking actions that truly solve problems. 🍞 Bottom Bread (Anchor): Think of a robot helper learning to set a table. Each time it places plates correctly, it earns a point. Over time, it learns a reliable routine.

🍞 Top Bread (Hook): You know how using a magnifying glass helps you find tiny details in a picture? Tools let you act, not just look.

🥬 Filling (The Actual Concept – Dynamic Tooling): What it is: Dynamic tooling means the model writes and runs Python code on the fly to do exactly the image or video operations it needs (crop, zoom, sample frames, draw, measure), instead of relying on a small, fixed toolset. How it works:

The model thinks in text.
It decides to call a tool by writing Python that will, for example, crop a region or fetch frames from a video.
The code runs in a sandbox, returning clues (numbers, plots, or images).
The model reads those clues and continues reasoning, repeating as needed. Why it matters: Without dynamic tools, the model is stuck with one-size-fits-all buttons that might not match the problem. 🍞 Bottom Bread (Anchor): Like a chef who can grab any utensil and invent a technique in the moment, the model can write the exact code it needs to slice a high-res image or skim a video.

🍞 Top Bread (Hook): Imagine you start practicing piano every day, then suddenly stop practicing at all. Your skills slip! Some AIs do that after training—they stop using their helpful tools.

🥬 Filling (The Actual Concept – Interaction Collapse): What it is: Interaction collapse is when a model, after RL, avoids using tools and shortens its reasoning, producing shallow answers. How it works:

If training rewards don’t value tool use or multi-step thinking, the model learns to answer quickly.
Short answers sometimes get partial credit, tricking the learner into thinking less is more.
Over time, tool calls drop, and performance on hard tasks suffers. Why it matters: Without sustained interaction, the agent cannot unpack complex images or long videos. 🍞 Bottom Bread (Anchor): It’s like trying to win a mystery game by only reading the first sentence of each clue—you’ll miss the solution.

The world before: Large Language Models were great at chatting but less great at acting with pictures and videos. Early multimodal agents used either static toolsets (predefined crop/zoom buttons) or focused on image-only dynamic tooling. Video was mostly handled by sampling frames at fixed rates and tossing them all into the model’s context. That was wasteful and often missed the most relevant moments.

The problem: When researchers trained open-weight multimodal models with RL to use tools, the models often suffered interaction collapse. They stopped calling tools, shortened their responses, and lost the very benefits agentic behavior promises: careful observation, iterative checking, and grounded answers.

Failed attempts: 1) Rewarding only final correctness, without valuing the helpful steps, let the model guess fast instead of working thoroughly. 2) Using all generated rollouts (including broken code runs or groups where every attempt looked the same) made learning noisy and unstable. 3) Uniform video frame sampling stuffed tons of frames into the context, burning tokens without improving decisions.

The gap: A training recipe that both stabilizes agent–tool interaction and actively rewards sustained, multi-turn tool use, plus a video strategy that fetches only the frames that matter.

What this paper adds: PyVision-RL, a unified framework that (a) uses Python as the flexible, dynamic tool; (b) strengthens training with an oversampling–filtering–ranking pipeline that keeps only informative, stable rollouts; (c) adds an accumulative tool reward so useful multi-turn behavior grows instead of collapses; and (d) introduces on-demand context construction for videos, so the agent plots only task-relevant frames.

Real stakes (why you should care):

Schoolwork: Solving diagram-heavy math or science problems needs zooming, measuring, and checking.
Work: Reading charts, inspecting medical images, or scanning security footage requires stepwise, tool-using reasoning.
Everyday life: Picking the best frame from a long video, counting items in a room, or tracking where an object moved benefits from selective, on-demand viewing.
Cost and speed: Fewer visual tokens means quicker, cheaper answers, especially for long videos.
Open community: An open-weight recipe empowers researchers and builders to improve and deploy safely.

02Core Idea

🍞 Top Bread (Hook): Picture a detective who doesn’t try to memorize an entire city—she walks to the block she needs, pulls out a magnifying glass only when helpful, and keeps taking notes until the case is cracked.

🥬 Filling (The Actual Concept – The Aha!): One-sentence insight: Teach the model to be a steady, tool-using investigator and reward it for sustained, useful steps, while only bringing video frames into view when they’re needed.

Multiple analogies:

Detective analogy: Instead of reading every page in a file, the agent fetches evidence (frames, crops, plots) on demand and keeps going until the answer holds up.
Chef analogy: Rather than using a fixed six-tool set, the agent opens a full kitchen (Python) and invents the exact utensil or technique for this dish.
Librarian analogy: For a long video, don’t photocopy the whole book. Photocopy just the pages that matter to your question.

Before vs After:

Before: RL-trained agents often gave shorter answers and avoided tools; video models crammed uniform frame samples into context.
After: PyVision-RL nudges agents toward multi-turn, grounded tool use with steady training. For videos, it selectively pulls in frames during reasoning, cutting token costs.
Before: Many rollouts were useless (broken code, zero-variance groups) and destabilized learning.
After: Oversampling–filtering–ranking keeps only informative, varied groups, reducing noise and improving stability.

Why it works (intuition, no equations):

Incentives shape behavior: If rewards only praise the final answer, the model may gamble with short guesses. Adding a gentle bonus for helpful tool use—only when the final answer is correct—encourages deliberate, multi-turn reasoning without rewarding useless activity.
Data selection stabilizes learning: Filtering out broken or uninformative rollouts prevents the model from learning from noise. Ranking groups by reward spread (standard deviation) focuses training on the sweet spot: not too easy, not too hard.
On-demand video context saves tokens: Pulling only the frames that matter avoids drowning the model in irrelevant visuals.
A calmer objective: Removing a noisy normalization term in GRPO keeps gradients steadier and training smoother.

Building blocks (mini-sandwiches for each concept):

🍞 Hook: You know how a coach watches replays to plan the next move? 🥬 Reinforcement Learning (RL):
- What it is: Learning by trying actions and getting rewards.
- How: Try → get reward → adjust strategy → repeat.
- Why it matters: It turns passive talkers into active solvers. 🍞 Anchor: A spelling-bee kid improves by seeing which study tricks work.
🍞 Hook: Tools make tough jobs easier. 🥬 Dynamic Tooling (Python-as-a-tool):
- What: The agent writes and runs custom Python to crop, measure, or fetch frames.
- How: Think → write code → execute → read clues → continue.
- Why: One flexible tool beats many rigid buttons. 🍞 Anchor: Like using exactly the right kitchen gadget for a tricky recipe.
🍞 Hook: Stopping practice makes skills fade. 🥬 Accumulative Tool Reward:
- What: A gentle bonus that increases with helpful tool calls, but only when the final answer is correct.
- How: Correct answer? Add a small extra per useful tool use.
- Why: Prevents the model from quitting tools too early. 🍞 Anchor: Extra points for showing your math work—when the answer is right.
🍞 Hook: Why read a whole book for one fact? 🥬 On-Demand Context Construction (video):
- What: Load the video into the runtime, then fetch and plot only the frames you need.
- How: Reason → decide frames → fetch via Python → inspect → continue.
- Why: Cuts token costs while focusing on relevant moments. 🍞 Anchor: Skipping to the highlight reel instead of watching every second.
🍞 Hook: Practice on problems that stretch you, not the ones you always ace or always miss. 🥬 Oversampling–Filtering–Ranking Rollout Strategy:
- What: Generate many attempts, drop broken and zero-variance groups, then keep the groups with the most useful variation.
- How: Oversample → filter (broken, zero-variance) → rank by reward spread → train on top groups.
- Why: Stabilizes learning and avoids wasting compute on noise. 🍞 Anchor: Trying several science fair ideas, then perfecting the ones that show promise.

03Methodology

At a high level: Input (image or video + question) → Model thinks and may write Python code → Sandbox executes code and returns visual/text clues → The loop repeats (think → act → see) until the model outputs the final answer.

Step-by-step (the agentic scaffold):

The model is prompted to interleave thoughts and code. It can write Python in special blocks. The sandbox runs the code and returns outputs (numbers, plots, processed images) back into the conversation context.
- Why this step exists: It turns the model from a passive reader into an active investigator.
- Example: For “Which triangle is largest?”, the model crops regions, measures areas in pixels, and uses the results to decide.
Image hint injection (PyVision-Image): The same image is both visible to the model (for language reasoning) and loaded in the Python runtime (for code-based processing).
- Why it matters: The model can talk about what it sees and also manipulate the actual pixels.
- Example: It can zoom into a chart legend or compute color histograms to confirm which circle is darkest.
Video on-demand context construction (PyVision-Video): The full video is loaded only in the Python runtime. The model cannot see it directly; it must fetch frames with code.
- Why this step exists: It stops the model from stuffing the entire video into context, which wastes tokens and increases confusion.
- How it works: The model reasons about what frames to sample (e.g., the last half for “What happens at the end?”), fetches and plots them, reviews the visuals, and continues.
- Example: To count tables in a room, it samples frames around timestamps where the camera turns, plots them, and tallies distinct tables.

The secret sauce in training: keeping interaction alive and stable

🍞 Hook: Imagine practicing basketball. If you only count final scores and ignore good passes, your team will stop passing. Training needs to value the right behaviors.

🥬 Filling (Accumulative Tool Reward):

What it is: A small extra reward that grows with the number of helpful tool calls, but only when the final answer is correct.
How it works:
1. After an attempt (a rollout), check if the final answer is right (accuracy reward).
2. Count tool calls during that attempt.
3. If the answer was correct, add a small bonus per tool call to the reward.
Why it matters: It encourages the agent to keep using tools when they help, preventing interaction collapse. 🍞 Anchor: Like getting extra credit for showing proper lab steps—only if the experiment worked.

🍞 Hook: If you include broken homework in your grade, your report card won’t reflect real learning.

🥬 Filling (Oversampling–Filtering–Ranking Rollout Strategy):

What it is: A quality-control pipeline for training attempts.
How it works:
1. Oversample: Generate more rollouts per problem than you need.
2. Filtering: Drop groups where every rollout has the same reward (zero variance) and remove broken runs (code timeouts, execution errors, missing images).
3. Ranking: Among the rest, compute reward spread (standard deviation) per group, which acts like a difficulty signal; keep top groups with healthy variation.
Why it matters: Training focuses on informative, stable data and avoids noisy or useless efforts, improving convergence and performance. 🍞 Anchor: Like auditioning many dancers, removing those who never got to perform (broken) and those who all did exactly the same move (zero-variance), then coaching the groups with the most teachable variety.

🍞 Hook: Sometimes a math formula with too many moving parts makes your calculator jittery.

🥬 Filling (Removing Standard Deviation Normalization in GRPO):

What it is: A small change to the advantage calculation in the RL algorithm (GRPO) that removes a variance-normalization term.
How it works:
1. Compute each rollout’s reward.
2. Subtract the group’s average reward to get a centered learning signal.
3. Skip dividing by the group’s standard deviation, which can inject noise.
Why it matters: The gradients become calmer, and training becomes more stable over time. 🍞 Anchor: It’s like smoothing out a bumpy road so the wheels don’t wobble.

Data and training recipe (simplified):

Warm start with SFT (supervised fine-tuning): • PyVision-Image-SFT: ~7K curated examples emphasizing multi-turn tool use across charts, infographics, medical images, math diagrams, and general VQA. Samples with wrong answers or fewer than two tool turns are filtered out. • PyVision-Video-SFT: ~44K examples for spatial/long-video reasoning so the agent learns to fetch frames on demand.
RL specialization: • PyVision-Image RL: visual search (e.g., high-res zoom) and multimodal math. • PyVision-Video RL: spatial reasoning on videos.
Training loop:
1. Sample many prompts and generate groups of rollouts.
2. Execute code, collect rewards (final correctness + small tool bonus if correct).
3. Oversample–filter–rank rollouts; keep the best groups.
4. Update the model with stable, centered advantages (without std-dev normalization).

Concrete example walk-throughs:

Image: “Which circle is darkest?” The agent zooms into regions, plots histograms of pixel intensities with Python, compares distributions, and picks the correct choice. Without tools, it might just guess; with tools, it proves its case.
Video: “How many tables are in this room?” The agent samples frames at key times, plots them, compares surroundings (chairs, TV stands), and counts distinct tables. It avoids loading every frame into context, saving tokens.

The secret sauce: All three pillars—accumulative tool reward, oversampling–filtering–ranking, and on-demand video frame fetching—work together. Rewards keep the agent curious, the rollout pipeline keeps learning stable, and the video scaffold keeps tokens lean. The result is an agent that acts more like a careful scientist: plan, test, observe, and conclude.

04Experiments & Results

The test: The authors measured how well the models answer questions about images and videos, how efficiently they use visual tokens (context length), and whether training stays stable over time. They used well-known benchmarks for visual search, multimodal math reasoning, tool-using (agentic) reasoning, and video spatial reasoning.

The competition: They compared against strong open-weight baselines and agentic systems, including Qwen2.5-VL-7B (no special tool RL), DeepEyes-v2 (dynamic tooling for images), Video-R1 (text-only video reasoning), and VITAL (agentic video model with predefined clipping tools).

The scoreboard with context:

PyVision-Image vs baselines on images: • Visual search: +10.2% (V*), +6.5% (HRBench-4K), +6.4% (HRBench-8K) over Qwen2.5-VL-7B. That’s like jumping from a solid B to an A on high-resolution spotting tasks. • Multimodal math: New SOTA over DeepEyes-v2: +4.4% (DynaMath), +3.1% (MathVerse), +9.6% (WeMath). Think of it as solving several more tough diagram problems per test. • Agentic reasoning: +3.8% over base model on TIR-Bench, showing better multi-turn problem solving.
PyVision-Video vs baselines on videos: • Accuracy: 44.0% on VSI-Bench, surpassing Qwen2.5-VL-7B (38.0%) and VITAL, and competitive with other agentic approaches. • Token efficiency: About 5K visual tokens per sample for PyVision-Video vs ~45K for Qwen2.5-VL-7B at its best setting. That’s nearly a 9x reduction in visual context size while still being more accurate. • Category breakdown highlights: Strong gains on categories like object counting and route/ordering tasks, where fetching the right frames matters most.

Training stability and behavior (surprising but good findings):

With the accumulative tool reward, the average number of tool calls increases during RL and then stabilizes at a healthy level, showing the agent learns to “keep practicing” rather than give up.
Oversampling–filtering–ranking sharply reduces “correct-but-penalized” cases (positive samples with negative advantages), cleaning up harmful signals and improving convergence.
Removing the standard deviation normalization in GRPO lowers variance in the learning signals, leading to smoother training curves and fewer performance dips.
Larger max turn budgets don’t show instant benefits but unlock higher performance later in training—like building endurance before speed.

Meaning of the numbers:

The +10.2% jump on V* means the model more reliably zooms into just the right spots in big images.
The +9.6% on WeMath reflects better diagram reading and math-grounded visual reasoning, not just guesswork.
The 5K vs 45K token gap on videos means cheaper, faster runs that are less likely to hit context limits, especially on long clips.

Takeaway: The combination of steady incentives (tool reward), careful rollout selection (oversample–filter–rank), and on-demand video frames yields models that interact more, reason deeper, and spend tokens wisely.

05Discussion & Limitations

Limitations:

Python tool safety: Because the agent writes code, it must run inside a sandbox with tight permissions. Without careful isolation, it could access files or external resources in unintended ways.
Reward shaping trade-offs: Paying for tool use (only when the final answer is correct) encourages interaction, but might also lengthen answers when a brief solution would suffice. Tuning the bonus is important.
Broken code handling: Although the pipeline filters out broken runs, heavy reliance on third-party libraries and runtime stability can still cause hiccups.
Beyond frames: Some video questions need rich temporal modeling (motions, speeds) that go beyond sampling and plotting frames. Extra temporal tools could help.
Data coverage: Performance depends on the variety and quality of SFT and RL datasets. Rare skills may need targeted data.

Required resources:

A secure Python sandbox with plotting and basic vision libraries.
GPUs for RL fine-tuning; the authors used multiple high-end GPUs.
A curated prompt pool and reward-checking scripts (answer validators) for training.

When NOT to use:

Simple questions that don’t benefit from tools (the extra steps may add latency).
Highly restricted or unsafe environments where sandboxing cannot be guaranteed.
Tasks demanding millisecond-latency responses without any tool overhead.

Open questions:

Can we make the tool reward adaptive—encouraging tools when helpful but discouraging them when unnecessary?
How to expand the toolset safely (e.g., geometry, OCR, tracking) while keeping the runtime secure and stable?
Can the oversampling–filtering–ranking approach be made even smarter, perhaps learning a data curriculum automatically?
How well does this transfer to other modalities (audio, 3D, sensors) with on-demand context construction?
Can we blend on-policy and off-policy RL stably to reuse more rollouts without harming learning?

06Conclusion & Future Work

Three-sentence summary: PyVision-RL is a unified, open-weight framework that teaches multimodal models to act like careful investigators—thinking in multiple turns, using Python tools, and fetching only the video frames that matter. It prevents interaction collapse with an accumulative tool reward and stabilizes RL via an oversampling–filtering–ranking pipeline, while a small GRPO tweak calms training noise. The result is state-of-the-art image reasoning, competitive and efficient video reasoning, and far fewer visual tokens.

Main achievement: Showing that sustained, rewarded interaction plus on-demand visual processing can make open multimodal agents both stronger and more efficient, especially on videos.

Future directions:

Smarter, adaptive tool rewards that balance brevity and depth.
Richer, safer toolboxes (e.g., motion analysis, OCR, segmentation) and tighter sandboxing.
Extending on-demand context construction to audio streams, 3D scenes, and robotics sensors.
Data curricula that automatically pick the most teachable rollouts at each stage.

Why remember this: It demonstrates a principled way to keep agents curious and grounded—use tools when they help, fetch visuals only when needed, and learn from the most informative practice runs. That recipe scales better, costs less, and solves harder multimodal problems.

Practical Applications

•Interactive homework helpers that zoom into diagrams, measure angles or areas, and show work for math and science problems.
•Customer support triage that inspects screenshots step by step to locate UI issues and guide users.
•Healthcare imaging pre-checkers that highlight key regions in X-rays or charts for clinician review (with strict sandboxing).
•Security and safety video summarizers that fetch only relevant frames (e.g., entry/exit moments) to answer incident questions.
•Industrial inspection assistants that crop and analyze high-res photos of parts to detect defects or measure wear.
•Data analysts’ copilots that read charts, sample visual evidence, and cross-check numeric claims before reports go out.
•Education tools that teach students how to reason visually, showing intermediate plots and measurements before the final answer.
•Robotics perception debugging, where the agent selectively inspects frames to verify what the robot is seeing and doing.
•Content moderation spot-checkers that sample critical frames instead of reviewing entire videos.
•UX research tools that compare interface screenshots, zoom into problem areas, and generate evidence-backed summaries.

Version: 1