Phi-4-reasoning-vision-15B Technical Report

Jyoti Aneja; Michael Harrison; Neel Joshi; Tyler LaBonte; John Langford; Eduardo Salinas

Phi-4-reasoning-vision-15B Technical Report

Intermediate

Jyoti Aneja, Michael Harrison, Neel Joshi et al.3/4/2026

arXiv

Key Summary

•Phi-4-reasoning-vision-15B is a small, open-weight AI that understands pictures and text together and is especially good at math, science, and using computer screens.
•It uses a mid-fusion design: a vision encoder turns image parts into tokens that are mixed with words inside a language model for efficient, accurate reasoning.
•A dynamic-resolution vision encoder lets the model see tiny details on high-resolution screens (like small buttons), which greatly improves grounding and computer-use tasks.
•The team proved that careful data curation—filtering bad data, fixing errors, and adding smart synthetic examples—matters more than just using huge datasets.
•Training happens in three stages: align images to text (MLP only), teach general multimodal skills (all parts train), then extend to long/multi-image and safety.
•The model smartly switches: quick direct answers for simple perception tasks and step-by-step chain-of-thought for harder problems (users can force <nothink> or <think>).
•On popular tests like AI2D, ChartQA, MathVista, MMMU, OCRBench, and ScreenSpot v2, it reaches accuracy like much slower, bigger models while using far fewer tokens and time.
•Ablations show higher visual token budgets and dynamic resolution consistently help, especially on high-res benchmarks like ScreenSpot-Pro.
•Balancing data for math and computer use can improve both at once—adding more math didn’t hurt GUI skills and sometimes boosted them.
•It’s released with open weights and code so the community can study, reproduce, and extend compact multimodal reasoning systems.

Why This Research Matters

This work shows everyday users can get strong multimodal help without needing giant, expensive models. Apps can run faster and cheaper while still understanding small visual details, like tiny buttons on a screen or small labels on a chart. Students and teachers benefit because the model can both show its work on tough math/science problems and give quick answers when that’s enough. Developers can build reliable computer-using agents that navigate real software UIs with low latency. Open weights and code mean communities can adapt, audit, and improve the model for their own needs. Better data quality practices raise the field’s standards, so future models learn more from fewer tokens.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how it’s easier to carry a light backpack than a heavy one when you want to move quickly? Early vision–language AIs kept getting bigger and heavier—more parameters, more training tokens, more time to answer—so they were powerful but slow and expensive to run. That meant they were hard to use on everyday computers and phones, especially for interactive tasks like helping you use a website, understanding a dense document, or solving a math problem with a diagram. Before this work, many teams believed top performance required huge models and trillion-token training. That worked, but the cost and latency were high. Smaller, faster models existed, but they often struggled with tricky reasoning, especially in math and science, or missed tiny visual details on high-resolution screens (like small icons, menu items, or chart tick labels). So you had to choose: fast but forgetful, or smart but slow. The problem researchers faced was getting the best of both worlds: How do you make a compact model that can (1) see small, important visual details, (2) reason step-by-step when needed, (3) answer quickly when reasoning isn’t needed, and (4) do all this with less training data and compute? A second challenge was data. Public datasets are mixed in quality—some have wrong answers, broken formats, or confusing prompts. Feeding this mess to a model can teach it bad habits. Finally, models that always show their chain-of-thought can be slower and overly wordy for simple tasks like captioning or OCR, but not reasoning can hurt performance on math and science. So, when should a model think out loud and when should it just answer? People tried several things that fell short. Early-fusion designs (one big transformer that mixes image patches and text from the start) can be powerful, but they are very costly in memory and compute. Late-fusion systems keep vision and language apart too long and miss useful cross-talk. Some teams poured in more and more data without filtering, which didn’t scale well in quality. Others trained only on reasoning traces, making the model verbose even when a short answer would do. And many used fixed, lower image resolutions, which miss the tiny, crucial bits of information on modern UIs. What was missing was a practical recipe that combined: a smart, efficient architecture; a vision system that adapts to image detail; and a training approach that mixes direct answers with chain-of-thought on the right examples. Plus, a strong commitment to data quality—filtering, fixing, and augmenting so the model learns from good signal, not noise. The stakes are real in daily life. If your model can quickly read receipts, follow garment care icons, or highlight the right button in a software tutorial, it saves time. If it can solve a diagram-based physics problem or read a chart accurately, it helps students and scientists. If it runs on modest hardware, schools and small teams can use it. And if it avoids needless long reasoning for simple tasks, apps feel snappy instead of sluggish. To address this, the team built Phi-4-reasoning-vision-15B—a compact, open-weight multimodal reasoning model. It pairs a proven language backbone (Phi-4-Reasoning) with a strong vision encoder (SigLIP-2), joined through a mid-fusion design. It leans heavily on high-quality, well-filtered data and includes a dynamic-resolution vision setup so the model can see tiny details when they matter. It also learns a mixed mode: quick answers for easy perception tasks and thoughtful step-by-step reasoning when the problem is hard. The result: competitive accuracy with much less training compute and far lower latency at inference time, especially on scientific, mathematical, and computer-use tasks.

02Core Idea

🍞 Hook: Imagine wearing glasses that auto-zoom only when needed and a brain that decides when to think out loud versus when to just say the answer. That’s the heart of this model.

🥬 The Concept (Aha!): The key insight is that careful architecture plus high-quality, curated data can let a small model see better and think smarter—then answer fast when it’s easy and reason step-by-step when it’s hard.

How it works, big picture:

Use a mid-fusion design so image tokens and text tokens can interact efficiently.
Give the vision encoder dynamic resolution so small, important details aren’t missed.
Curate data: filter out wrong or messy samples, fix formatting, and add targeted synthetic examples.
Train the model to switch modes: short direct answers for perception tasks; chain-of-thought for tricky math/science—users can force <nothink> or <think> as needed.

Why it matters: Without this combo, small models either miss tiny details or waste time reasoning out loud for everything. This approach keeps answers accurate and fast.

🍞 Anchor: Think of reading a busy webpage: the model spots the exact small “Submit” button (thanks to dynamic vision), then instantly tells you to click it (<nothink>). If you show a physics diagram, it switches to step-by-step reasoning (<think>) to compute the final answer.

— Multiple analogies — • Camera-and-brain: A camera that auto-zooms so you never miss fine print, plus a brain that decides when to talk through a solution vs. just state it. • Chef-and-recipe: The chef mixes ingredients (image and text) at the right time (mid-fusion) and decides whether to explain every cooking step (reasoning) or just serve the dish (direct answer). • Librarian-and-study: A librarian picks only reliable books (curated data), and a student decides whether to show their work (chain-of-thought) or just write the final answer.

— Before vs After — • Before: Small models were fast but often missed details or failed hard problems; big models did better but were slow and costly. • After: A compact model reaches competitive accuracy, sees small UI elements and chart details, and responds fast when possible, reasoning only when helpful.

— Why it works (intuition) — • Mid-fusion lets vision and language talk early enough to help but not so early that compute explodes. • Dynamic resolution ensures the right amount of visual detail, so reasoning isn’t tripped up by poor perception. • Data quality amplifies learning—less noise, more signal—so fewer tokens can teach more. • Mixed-mode training aligns behavior with task needs, cutting latency without hurting hard-task accuracy.

— Building blocks (each with Sandwich explanations) — 🍞 Hook: You know how a camera turns a scene into pixels your brain can understand? 🥬 Vision Encoder: It’s the part that turns images into tokens the language model can use.

How: Split the image into patches, extract features, turn them into visual tokens.
Why: Without it, the language model can’t “see” pictures. 🍞 Anchor: Spotting text on a screenshot as tokens the model can read.

🍞 Hook: Imagine a backpack that fits both books (text) and pictures neatly together. 🥬 Compact Open-Weight Multimodal Model: A small AI that understands text and images together; its weights are openly available.

How: Vision encoder → projector → language model, all tuned to cooperate.
Why: Smaller, accessible, and easier to run. 🍞 Anchor: Running a helpful tutor on a school laptop without cloud costs.

🍞 Hook: Think of two teammates passing notes during a game at just the right time. 🥬 Mid-Fusion Architecture: Image tokens are injected into a language model midstream so both modalities interact efficiently.

How: Encode image → project tokens → interleave with text → transformer layers share context.
Why: All-early fusion is too heavy; too-late fusion loses synergy. 🍞 Anchor: Answering a question about a chart by referencing both the question words and the chart tokens inside the same context window.

🍞 Hook: Like zooming in on a tiny map label only when you need it. 🥬 Dynamic Resolution Vision Encoder: The encoder adapts token count/resolution so small details are preserved when important.

How: Adjust patches/max tokens; use NaFlex-like variants to scale up to HD.
Why: Missing small UI elements or fine text ruins downstream reasoning. 🍞 Anchor: Detecting a 12px “OK” button on a 720p screenshot.

🍞 Hook: You know how teachers check homework and correct mistakes? 🥬 Data Curation: Carefully filtering, fixing, and enriching datasets to boost learning quality.

How: Remove low-quality samples, correct answers/formatting, synthesize better captions and Q&A.
Why: Bad data teaches bad habits; quality beats quantity. 🍞 Anchor: Rewriting wrong math answers so the model learns the right steps.

🍞 Hook: When you solve a math problem, you might show your work. 🥬 Chain-of-Thought Reasoning: The model writes out steps inside <think>…</think> before the final answer.

How: Train on step-by-step examples; use special tokens.
Why: Hard problems need structured reasoning; otherwise errors creep in. 🍞 Anchor: Explaining each step of a physics derivation from a diagram.

🍞 Hook: Sometimes you just answer; sometimes you explain. 🥬 Hybrid Reasoning Model: One model that defaults to quick answers for easy tasks and uses chain-of-thought for hard ones.

How: Mix reasoning and non-reasoning data; tag samples with <think> or <nothink>.
Why: Saves time and tokens while keeping accuracy high on tough tasks. 🍞 Anchor: Captions a receipt quickly, but shows steps for a geometry proof with a figure.

03Methodology

At a high level: Image(s) + Text → Vision Encoder + Projector (make visual tokens) → Mid-fusion with Language Model (reason or not) → Output (direct answer or chain-of-thought + answer).

Stage-by-stage recipe: • Stage 1: MLP Pretraining (Alignment)

What happens: Freeze vision encoder and language model. Train only the cross-modality projector (an MLP) so SigLIP-2 features align with the Phi-4-Reasoning text space.
Why: Without this warm-up, visual tokens and word tokens speak different “languages,” making learning unstable.
Example: Feed clean image–caption pairs (e.g., a photo of a bike → “A red bicycle parked by a fence”). The MLP learns to map visual features so the LLM can understand them.

• Stage 2: Instruction Tuning (Core Skills)

What happens: Unfreeze everything. Train on single-image instruction data spanning captioning, VQA, OCR, grounding, charts, documents, math/science reasoning, and computer-use (GUI) tasks. Mix reasoning samples (<think>…</think>) and direct-answer samples (<nothink> at start).
Why: This teaches broad multimodal competence and when to reason vs. when to answer directly. Without it, the model would either be verbose on everything or too shallow on hard problems.
Example: “Look at this chart and tell me the highest bar.” For simple perception, the model answers directly. For “Given this physics diagram, compute the spring constant,” it writes steps inside <think>.

• Stage 3: Long Context, Multi-Image, and Safety (RAI)

What happens: Continue training on long documents, multi-image sequences (e.g., time-lapse screens), and safety data (e.g., harmful content refusal). Increase max sequence length for longer inputs/outputs.
Why: Without long-context training, the model loses track in extended tasks. Without safety, it might respond inappropriately to risky prompts or images.
Example: A PDF with many pages: the model scans multiple figures and text to answer a summary question; for risky content, it refuses politely.

Inside the model pipeline:

Input preparation

Convert images to patches; choose dynamic resolution settings (e.g., up to ~3600 visual tokens for 720p) based on encoder limits.
Tokenize text prompt; prepend special mode tokens (<nothink> or let the model decide, and allow <think> in training data).
Why: Missing resolution leads to lost details; missing tags makes the model unsure when to reason.
Example data: A Windows desktop screenshot ( $1920×1080$ ) may be downscaled or tiled; dynamic encoders try to keep enough tokens to preserve small UI elements.

Vision encoding and projection

SigLIP-2 extracts dense visual features; MLP projects them into the LLM’s embedding space, producing “soft” visual tokens.
Why: The LLM can’t work directly with raw pixels. Projection makes image tokens compatible with text embeddings.
Example: A chart image becomes a series of tokens representing axes, bars, labels, and legend cues.

Mid-fusion in the language model

Interleave visual tokens with text tokens; feed them through the transformer layers of Phi-4-Reasoning.
Why: This allows contextual grounding: the question “Which month had the highest sales?” can attend to chart tokens referencing months and bars.
Example: The attention heads link “highest” with tall-bar tokens and read the label tokens to produce the correct month.

Decoding strategies and modes

Default: mixed behavior learned from data. For perception-like tasks (captioning, OCR, simple VQA), the model tends to reply concisely. For complex math/science, it tends to produce a <think> block, then the final answer.
Override: Users can force <nothink> for speed or <think> for transparency.
Why: Without control, you get either slowness (over-reasoning) or brittleness (under-reasoning).
Example: “Read the total from this receipt” → short answer; “Solve for x from this geometry diagram” → step-by-step derivation.

Data curation and augmentation (the secret sauce)

Systematic filtering: Remove low-quality or unanswerable Q&A, fix broken formats (e.g., mis-placed final answers, wrong tags), and drop flawed images.
Error correction: Regenerate wrong answers/captions with strong models and verification pipelines; keep only high-confidence results.
Synthetic augmentation: Use good images to generate better captions, math-diagram descriptions, multi-image matching, and “what’s changed?” sequences.
Balanced mixes: Experiment with math vs. computer-use proportions; add specialized GUI datasets (e.g., Phi-Ground) to boost grounding.
Why: Garbage in, garbage out. High-quality, diverse, and well-formatted data teaches reliable perception and reasoning with fewer tokens.
Example: For every math diagram, also generate a clean, detailed description so the model learns both to see and to reason.

Resolution handling ablations

Tried dynamic resolution (NaFlex-like SigLIP-2 up to ~2048 or ~3600 tokens), multi-crop, and multi-crop with S-tiling.
Findings: More visual tokens and dynamic resolution consistently help on high-res tasks; 3600-token settings notably improved ScreenSpot-Pro.
Trade-off: Higher context length increases compute quadratically; use text-task-aware strategies in future to allocate tokens where the question points.

Optimization details

Optimizer: AdamW; precision: bf16; ZeRO-1 for memory efficiency; cosine LR schedules with warmup; one epoch per stage with large curated datasets.
Why: Stable, efficient training lets the small model benefit maximally from the curated data.

Secret sauce summary:

Marrying mid-fusion efficiency, dynamic vision detail, and mixed-mode reasoning, all powered by high-quality, fixed-and-augmented data. Each part protects the others: accurate perception feeds correct reasoning; mixed modes cut latency; curated data keeps learning on track.

04Experiments & Results

The tests asked: Can a compact model see small details, reason when needed, stay fast, and match or beat similarly sized open-weight models? The team used two open frameworks (Eureka ML Insights and VLMEvalKit) to fairly measure accuracy, latency, and output token counts.

Benchmarks covered a wide range:

AI2D: Diagram understanding.
ChartQA: Quantitative reasoning from charts.
MathVista, MathVision, MathVerse: Mathematical reasoning in visual contexts.
MMMU and MMStar: Broad multimodal understanding across disciplines.
OCRBench: Text reading from images.
ScreenSpot v2 (and Pro in ablations): GUI grounding on high-res screens.

Competition included widely used open-weight families (Qwen3-VL, Kimi-VL, Gemma 3, and prior Phi variants), both in default non-thinking and thinking modes. To compare user-like latency, the team sampled subsets from ChartQA, MathVista, MMMU, and ScreenSpot, and measured wall-clock time and tokens on the same hardware (single H100, batch=1, temperature=0).

Scoreboard with context:

AI2D: 84.8%—like scoring an A when many peers are scoring A-/B+.
ChartQA: 83.3%—strong chart reading and number reasoning.
MathVista (MINI): 75.2%—solid performance on visual math.
MMMU (VAL): 54.3%—competitive across diverse subjects for a compact model.
OCRBench: 76%—reads text reliably.
ScreenSpot v2: 88.2%—excellent GUI element grounding on screens. Across the board, the model’s default mixed-reasoning behavior outperformed forcing always-thinking or always-non-thinking, with only a few exceptions (e.g., forcing thinking helped on some math-heavy sets; forcing non-thinking helped on ScreenSpot v2). That means the learned switch generally chooses the right mode.

Compute trade-offs:

When plotting accuracy versus latency and output tokens, Phi-4-reasoning-vision-15B pushed the Pareto frontier: similar or better accuracy than models with far more time and tokens, and better accuracy than equally fast peers. In simple terms: it gets high scores without making you wait.

Surprising findings and ablations:

Dynamic resolution matters: Allowing up to ~3600 visual tokens (about 720p) improved results on high-resolution tasks, especially ScreenSpot-Pro, compared to ~2048-token caps.
Multi-crop with S-tiling beat standard multi-crop even with fewer tokens, reinforcing that how you allocate vision tokens (not just how many) matters.
Data ratio experiments: Increasing math data $3× (keeping$ computer-use fixed) improved both math and GUI tasks. Adding specialized GUI grounding data (Phi-Ground) gave big gains on ScreenSpot-V2. This suggests shared underlying skills and the value of targeted domain data.
Mixed-mode advantage: Training with both <think> and <nothink> examples yielded the best default behavior—concise on easy tasks, step-by-step on hard ones—reducing unnecessary verbosity without sacrificing tough-problem accuracy.

Interpretation:

Accurate perception is a prerequisite for good reasoning. If you can’t see the tiny details, no amount of thinking will help. Dynamic resolution and careful image tokenization consistently paid off.
Data quality is king. Systematic filtering, error correction, and smart synthetic augmentation produced larger gains than just piling on more raw data.
A single compact model can excel across domains with the right mix: it doesn’t have to choose between math/science and computer use.

Bottom line: Relative to similarly sized, open-weight models, Phi-4-reasoning-vision-15B delivers a rare combo—competitive accuracy, fast responses, and fewer tokens—especially on scientific/mathematical problems and on computer-use tasks that demand high-resolution grounding.

05Discussion & Limitations

Limitations:

Larger proprietary models still win on broad, open-ended tasks and niche, ultra-detailed perception. This model aims for the best accuracy-per-compute, not absolute maximum accuracy.
Mode switching isn’t perfect. Sometimes it reasons when a short answer would do, or answers briefly when step-by-step would help. Users can override with <think> or <nothink>.
Ultra-fine visual nuances can be missed, especially beyond the chosen token budget/resolution. Critical use cases should verify outputs.

Required resources:

A single modern GPU (e.g., A100/H100 or strong consumer GPUs) runs the 15B model with responsive latency, especially when using efficient inference stacks.
Storage for the checkpoint and enough RAM/VRAM for mid-fusion context lengths (remember, vision tokens increase sequence length).
For finetuning or domain adaptation: curated, well-formatted datasets and standard training frameworks (DeepSpeed/ZeRO, bf16, AdamW).

When not to use:

If you need the absolute best generalist performance across arbitrary domains and don’t care about speed or cost, very large proprietary models may outperform this.
If your task demands ultra-precise reading of tiny, low-contrast features beyond the tested resolutions, consider higher-resolution pipelines or specialized OCR/vision models.
If you require guaranteed, always-correct chain-of-thought transparency (e.g., for strict auditing), remember the model may sometimes choose not to reason unless forced.

Open questions:

Optimal vision token allocation: Can prompts steer the encoder to focus on question-relevant regions and save tokens elsewhere (true question-conditioned tokenization)?
Best reasoning/non-reasoning split: Is ~20% reasoning data ideal, or does it depend on domain and latency targets? Can adaptive policies learn per-query mode selection more reliably?
Scaling laws for data composition: At much larger scales, do math and computer-use skills still reinforce each other, or do trade-offs appear?
Efficient high-res: Can we get ScreenSpot-Pro-level gains with smarter token placement rather than simply raising token budgets, to avoid quadratic attention costs?
Agentic robustness: How to ensure stable multi-step GUI interaction across diverse apps and layouts with minimal hallucination and safe refusals?

06Conclusion & Future Work

In three sentences: Phi-4-reasoning-vision-15B shows that a compact, open-weight multimodal model can be both sharp and speedy by combining mid-fusion architecture, dynamic-resolution vision, and carefully curated data. It learns to switch between quick direct answers and chain-of-thought reasoning, delivering strong performance on math, science, and computer-use tasks with far less compute and latency than many peers. The work pushes the accuracy-vs-cost Pareto frontier and provides open artifacts so others can replicate and build on these ideas. The main achievement is proving that data quality plus smart architectural choices let a small model see tiny details, ground precisely, and reason effectively—without flooding every task with long chains-of-thought. Looking ahead, research on question-conditioned visual tokenization, optimal reasoning mode selection, and larger-scale data composition could further raise accuracy while keeping latency low. Remember this paper as a roadmap for practical, accessible multimodal reasoning: see clearly, think when needed, answer fast when possible—and do it all in a compact, open package.

Practical Applications

•Build a desktop helper that highlights the exact UI element to click during software tutorials.
•Create an education tool that explains math and physics diagram problems step-by-step when requested.
•Deploy a mobile app that quickly reads receipts, menus, and signs with concise OCR outputs.
•Develop a data dashboard assistant that answers chart questions and spots trends from screenshots.
•Automate QA for web pages by grounding buttons, links, and forms and verifying accessibility labels.
•Construct a document understanding bot that summarizes multi-page PDFs and extracts key fields.
•Integrate into robotics or RPA systems to detect on-screen state and decide next actions reliably.
•Assist customer support by interpreting user-submitted screenshots and proposing precise fixes.
•Enable low-latency classroom devices to run a helpful tutor locally with open weights.
•Fine-tune on domain images (e.g., medical forms, lab equipment panels) to improve specialized workflows.

Version: 1