When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains

Ahmadreza Jeddi; Kimia Shaban; Negin Baghbanzadeh; Natasha Sharan; Abhishek Moturu; Elham Dolatabadi; Babak Taati

When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains

Intermediate

Ahmadreza Jeddi, Kimia Shaban, Negin Baghbanzadeh et al.3/1/2026

arXiv

Key Summary

•This paper asks a simple question: does reinforcement learning (RL) truly make medical vision-language models (VLMs) smarter, or just help them pick better from answers they already know?
•Using a controlled testbed (MedMNIST), the authors separate three ingredients: the vision part, supervised fine-tuning (SFT), and RL.
•They find that SFT expands what answers the model can produce at all (its support), while RL mostly sharpens choices so the first try is right more often.
•If a model already has many correct answers hidden in its possibilities (high Pass@K), RL boosts Accuracy@1; if not, RL helps little and can even shrink support.
•They propose a boundary-aware recipe: first measure support (Pass@K vs Accuracy@1), bridge low support with SFT, then apply RL to sharpen.
•They show linear probing that the vision towers are decent on many tasks, so many failures come from language-side alignment and decoding, not raw sight.
•With a small, balanced slice of PMC-VQA and consistency-aware GRPO, their RL-tuned OctoMed model achieves strong average results across six medical VQA benchmarks.
•Takeaway: RL is not magic; it’s best used after SFT has raised support, mainly to convert hidden knowledge into reliable first-try answers.
•This approach makes medical AI more predictable and efficient, which matters for clinical trust and safety.

Why This Research Matters

This work gives a practical, reliable way to improve medical AI without guesswork about training choices. By measuring both first-try accuracy and hidden support, teams can decide when to collect more data (SFT) and when to use RL to make answers more dependable. That leads to AI assistants that are not just sometimes right, but right on the first try—crucial in clinics where time and trust matter. The approach also avoids over-sharpening that can make models brittle, helping maintain performance across closely related tasks. Finally, the recipe works with small, balanced RL datasets, making it more feasible for hospitals and researchers with limited labeled data.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how a good doctor looks at an X-ray and also listens to your story to decide what’s going on? They use eyes and words together.

🥬 Filling (The Actual Concept): Vision-Language Models (VLMs) are AIs that look at pictures and read text together to answer questions or explain what they see. How it works:

A vision part turns the picture into features (like notes about shapes and textures).
A language part reads the question and the vision features.
The model generates an answer in text. Why it matters: Without VLMs, we’d need separate systems for vision and text, and they’d miss how images and words help each other. 🍞 Bottom Bread (Anchor): A VLM can read, “Does this chest image show pneumonia?” while looking at the X-ray, and answer yes/no with a short explanation.

🍞 Top Bread (Hook): Imagine a hospital with many departments—radiology, dermatology, pathology—each with its own kind of image.

🥬 Filling (The Actual Concept): A modality is a type of medical image (like X-ray, MRI, skin photo, or microscope slide). How it works:

Each modality has different patterns and noise.
Models must learn features that work for each modality.
Some features transfer across related modalities, others don’t. Why it matters: Without understanding modality differences, one model may fail when the image type changes. 🍞 Bottom Bread (Anchor): A model trained on retinal photos may not work on pathology slides without extra help.

🍞 Top Bread (Hook): Think of studying with an answer key before a test.

🥬 Filling (The Actual Concept): Supervised Fine-Tuning (SFT) is training with labeled examples so the model learns correct input→output pairs in the target domain. How it works:

Gather image–question–answer examples.
Train the model to reproduce the answers.
Repeat until it does well on validation data. Why it matters: Without SFT, a general model may talk well but miss domain-specific facts. 🍞 Bottom Bread (Anchor): After SFT on medical Q&A, a model stops giving general health tips and starts giving imaging-specific answers.

🍞 Top Bread (Hook): Picture a puppy learning tricks with treats.

🥬 Filling (The Actual Concept): Reinforcement Learning (RL) teaches a model by giving rewards for good answers, guiding it to pick better actions over time. How it works:

The model proposes an answer (sometimes with reasoning steps).
A reward function checks if it’s right (verifiable rewards for multiple choice).
The model updates to favor answers that earned higher rewards. Why it matters: Without RL, the model may know the answer somewhere in its brain but not choose it reliably on the first try. 🍞 Bottom Bread (Anchor): For a multiple-choice question, RL nudges the model to pick the correct option letter more consistently.

The world before this research: Medical VLMs were getting better thanks to large-scale pretraining and SFT. But results after RL were inconsistent across image types and tasks. Some papers reported nice gains; others saw gains in one area but drops in another. People didn’t know whether RL was improving vision, improving reasoning, or just changing how the model samples answers.

The problem: Clinicians need reliability. It’s not enough that a model can be right sometimes; it should be right on the first try. Researchers suspected RL helps with “reasoning,” but it wasn’t clear if RL truly creates new abilities beyond what the model could already do.

🍞 Top Bread (Hook): Imagine a student who has the right answer somewhere on a scratch sheet but marks the wrong bubble on the test.

🥬 Filling (The Actual Concept): Accuracy@1 is whether the model’s top answer is correct on the first try; Pass@K checks if any of K tries includes the correct answer. How it works:

Accuracy@1: generate once; check if correct.
Pass@K: generate K times; success if any is correct.
Compare the two to see hidden potential. Why it matters: Without both metrics, we can’t tell if errors are lack of knowledge or just poor first-try choices. 🍞 Bottom Bread (Anchor): If Accuracy@1 is 30% but Pass@16 is 70%, the model “knows” the answers but struggles to pick them first.

Failed attempts: Many pipelines did SFT then RL, but because data mixtures and rewards differed, nobody could easily tell what improved what. Some RL models even got worse when moved to new modalities. It was like sharpening a pencil that didn’t have enough graphite—the point got finer, but there wasn’t much to write with.

🍞 Top Bread (Hook): Think of a vending machine stocked with snacks. Some snacks are there, others are missing.

🥬 Filling (The Actual Concept): Support is the set of answers the model can realistically produce; high support means the correct answer appears among samples. How it works:

Sample multiple answers.
If the right one appears often enough, support is non-trivial.
If it rarely appears, support is weak. Why it matters: Without support, no amount of sharpening can make the right answer show up. 🍞 Bottom Bread (Anchor): If the vending machine never stocks pretzels, pressing buttons differently won’t ever get pretzels.

The gap this paper fills: It cleanly separates three axes—vision quality, SFT, and RL—and shows when each matters. It demonstrates that SFT tends to expand support, while RL mostly sharpens choices, improving first-try accuracy when support already exists.

Real stakes: In clinics, trust comes from being right consistently, not just sometimes. If RL is used at the wrong time, it might make the model overconfident but not more capable. Used at the right time, it turns hidden know-how into reliable answers, saving time and reducing risk.

02Core Idea

The “Aha!” moment in one sentence: RL helps most after SFT has already expanded the model’s support; then RL mainly sharpens the model’s output so the first try is right more often.

🍞 Top Bread (Hook): Imagine sorting marbles: first you collect enough good marbles, then you polish them so they shine.

🥬 Filling (The Actual Concept): Sharpening means reweighting the model’s existing answer choices so it picks correct ones more consistently on the first try. How it works:

Measure how often correct answers appear across multiple samples (support).
If support is decent, apply RL with verifiable rewards.
RL increases the probability of correct answers and decreases distractors. Why it matters: Without sharpening, the model may keep hemming and hawing; with over-sharpening and low support, it can become confidently wrong. 🍞 Bottom Bread (Anchor): With Pass@16 already high, RL boosts Accuracy@1 like turning a bag of mostly-good marbles into a handful of reliably excellent ones.

Three analogies:

Library analogy: SFT is adding the right medical books to the shelf (expanding support). RL is adding sticky notes to the best chapters so you open the right page first (sharpening).
Sports analogy: SFT teaches all the plays in the team’s playbook (coverage). RL practices which play to call first under pressure (first-try accuracy).
Vending machine analogy: SFT stocks the correct snack (support). RL rearranges the buttons so the most popular snack is picked first (sampling efficiency).

Before vs After:

Before: People assumed RL injects new reasoning ability by itself.
After: This paper shows RL usually reshapes the output distribution—great when correct answers already live there, limited when they don’t.

Why it works (intuition):

If correct answers already appear among multiple samples (high Pass@K), then the model’s representation and knowledge are sufficient. The problem is selection, not creation. RL with verifiable rewards increases the chance the correct option becomes the top pick, lifting Accuracy@1 and reducing the gap between Pass@K and Accuracy@1.
If correct answers rarely appear (low Pass@K), the issue is missing support—insufficient visual understanding, domain knowledge, or alignment. RL can overfit to narrow cues, sometimes reducing Pass@K by pruning the space too aggressively.

🍞 Top Bread (Hook): Think of a road with a guardrail showing how far you can safely drive.

🥬 Filling (The Actual Concept): A support boundary is the line between what the model can already produce with some sampling and what it can’t. How it works:

Measure Accuracy@1 and Pass@K.
The larger the gap, the more hidden ability exists.
Use this to decide whether to bridge (expand) or sharpen (reweight). Why it matters: Without seeing the boundary, you might sharpen too early or keep training blindly. 🍞 Bottom Bread (Anchor): If Pass@16 is high but Accuracy@1 is low, you’re inside the boundary—time to sharpen. If both are low, you’re outside—time to bridge.

Building blocks (each introduced with a mini-sandwich):

🍞 Hook: You know how a picture book has images and captions. 🥬 Concept: Vision tower is the part that turns images into features the language model can read. How: Patch the image → encode with a transformer → produce embeddings. Why: Without it, the text model can’t see. 🍞 Anchor: The tower notes “bright spot here, texture there” so the text model can discuss them.
🍞 Hook: Testing if pictures are easily separable with a simple rule. 🥬 Concept: Linear probing fits a simple classifier on frozen image features to test how good they are. How: Freeze the vision tower → train a linear layer → measure accuracy. Why: Without probing, you can’t tell if failures come from vision or language. 🍞 Anchor: If a linear probe already classifies OCT images well, the issue is likely not sight but answering.
🍞 Hook: Imagine measuring how often your first guess is right versus how often you get it in three tries. 🥬 Concept: Accuracy@1 and Pass@K are paired metrics to separate first-try reliability from hidden support. How: One sample for Accuracy@1; several for Pass@K. Why: Without both, you mistake selection problems for knowledge gaps. 🍞 Anchor: 30% vs 70% tells you the model knows it but doesn’t pick it.
🍞 Hook: Deciding whether to study more or to practice test-taking. 🥬 Concept: Boundary-aware recipe decides between SFT (bridge) or RL (sharpen) based on measured support. How: Measure Pass@K and Accuracy@1 → if low support, do SFT; if good support, do RL. Why: Without the recipe, you waste effort and may harm generalization. 🍞 Anchor: It’s like adding missing chapters before drilling multiple-choice strategies.

03Methodology

At a high level: Input (medical images + questions) → Measure support (Accuracy@1, Pass@K) and vision quality (linear probe) → Bridge support with SFT if needed → Apply RL to sharpen → Output (higher Accuracy@1 with maintained Pass@K).

🍞 Top Bread (Hook): Imagine a science fair where you test each part of your robot car separately.

🥬 Filling (The Actual Concept): MedMNIST is a compact, multi-modality testbed to fairly test vision, SFT, and RL effects. How it works:

Use standardized $224×224$ images across 12 tasks spanning radiology, microscopy, and visible-light photos.
Create multiple-choice prompts from class labels.
Run fast, controlled experiments. Why it matters: Without a controlled testbed, you can’t isolate what helps and what doesn’t. 🍞 Bottom Bread (Anchor): If a model does well on PathMNIST after SFT but not after RL, you can spot that cleanly.

Step A: Probe visual perception (vision tower + linear probing)

What happens: Freeze each model’s vision encoder; train a simple linear classifier on MedMNIST tasks.
Why this step exists: If the vision features are weak (low probe accuracy), language-side tricks won’t fix missing sight. If they’re strong, problems likely lie in alignment/decoding.
Example: On several tasks, probes are strong (near MedViT-v2 on some) for the base and improve with medical SFT, but RL doesn’t improve probe accuracy—pointing to sampling/alignment effects.

🍞 Top Bread (Hook): Think of measuring both how often you nail the answer on the first try and whether you can get it within a few tries.

🥬 Filling (The Actual Concept): Accuracy@1 (first-try) and Pass@K (K-tries) diagnose hidden capability vs selection. How it works:

Turn each classification task into multiple-choice.
Greedy decoding gives Accuracy@1.
Temperature sampling K times gives Pass@K. Why it matters: Without this pair, you don’t know if RL is making new knowledge or just better picks. 🍞 Bottom Bread (Anchor): A big gap (low Accuracy@1, high Pass@K) means there’s support to sharpen.

Step B: Characterize the support boundary

What happens: Plot Pass@K curves and compare to Accuracy@1 across modalities; see how SFT and RL shift them.
Why this step exists: To see if SFT expands coverage (raises Pass@K) and if RL sharpens (raises Accuracy@1, possibly narrowing the gap).
Example: SFT boosts both metrics broadly; an off-the-shelf medical RL baseline often doesn’t lift Accuracy@1 on MedMNIST and sometimes reduces Pass@K.

🍞 Top Bread (Hook): If you already have the right answer in your top guesses, practicing which one to pick first helps a lot.

🥬 Filling (The Actual Concept): GRPO (Group Relative Policy Optimization) is an RL method that compares answers in a group and nudges the model toward higher-reward ones; a consistency-aware variant stabilizes training on small, medical datasets. How it works:

Generate candidate answers.
Score them with verifiable rewards (e.g., correct option letter).
Update the policy to favor better-than-average samples within the batch. Why it matters: Without stable RL, small data can cause overfitting or support collapse. 🍞 Bottom Bread (Anchor): On OrganAMNIST, GRPO can raise first-try picks without wrecking the variety of plausible answers.

Step C: When and how to apply RL

What happens: Train RL in three regimes: in-domain (same task), within-modality (related tasks), and cross-modality (different image types). Initialize RL either from the base model or from the SFT-bridged model.
Why this step exists: To test whether RL gains depend on starting support (Pass@K) and how well they transfer.
Example: From SFT, RL raises Accuracy@1 in-domain and in small shifts; from base with weak support, RL may reduce Pass@K.

🍞 Top Bread (Hook): Choosing whether to study more content or to practice test skills first.

🥬 Filling (The Actual Concept): Boundary-aware recipe: measure support; bridge with SFT if low; sharpen with RL if support is sufficient. How it works:

Estimate Accuracy@1 and Pass@K on a small validation set.
If Pass@K is below a target threshold, add proximal data and do SFT to raise support.
If Pass@K is good, apply RL to increase Accuracy@1 and monitor that Pass@K doesn’t collapse. Why it matters: Without this rule, teams often apply RL too early or too late. 🍞 Bottom Bread (Anchor): Like learning chapters you’ve never read before (SFT) before drilling multiple-choice timing (RL).

Step D: From analysis to practice (PMC-VQA instantiation)

🍞 Top Bread (Hook): Balancing your study set so you don’t over-practice just one subject.

🥬 Filling (The Actual Concept): PMC-VQA is a large set of medical image Q&As; the authors pick a small, balanced multiple-choice subset across modalities to train RL. How it works:

Use a larger model to auto-label the modality (e.g., MRI, CT, derm, OCT, microscopy) or none for ambiguous.
Sample 8,000 balanced questions.
Apply consistency-aware GRPO with verifiable multiple-choice rewards. Why it matters: Without balancing, RL might sharpen only one modality and hurt transfer. 🍞 Bottom Bread (Anchor): After balanced RL on top of OctoMed (already SFT-bridged), the model improves across six benchmarks.

Data flow summary: Images + question → vision tower features + text → measure Accuracy@1 & Pass@K; probe vision tower; if Pass@K low, do SFT; if Pass@K good, do RL (GRPO) → higher Accuracy@1 with maintained support.

04Experiments & Results

The tests: The authors evaluate three things on MedMNIST-v2 across radiology, microscopy, and visible-light photos: (1) vision quality via linear probing on frozen vision towers; (2) first-try accuracy (Accuracy@1) and hidden support (Pass@K) with multiple-choice prompting; (3) RL effectiveness when starting from base vs SFT-bridged models, tested in-domain, within-modality, and cross-modality.

The competition: They compare three families—Base VLM (Qwen2.5-VL-7B-Instruct), SFT-bridged (OctoMed-7B), and an RL-post-trained baseline (QoQ-Med-7B). For the final recipe, they further compare their RL-on-top-of-OctoMed model to strong public baselines on six medical VQA benchmarks.

The scoreboard with context:

Linear probing: Many tasks show decent separability in the base vision tower; SFT improves it further, while RL does not consistently boost probe accuracy. Think: the camera (vision) is mostly fine; the issue is picking answers (language/decoding) in many cases. Some tasks remain below a strong vision-only model (MedViT-v2), signaling real perception bottlenecks that cap downstream gains.
Accuracy@1 vs Pass@K: Across tasks, Accuracy@1 often trails far behind Pass@K. That’s like scoring a C on first tries while getting an A range if allowed multiple attempts—there’s hidden knowledge not being selected reliably. SFT lifts both metrics, implying broader coverage and better alignment. The RL baseline often fails to raise Accuracy@1 on MedMNIST and can reduce Pass@K, consistent with over-sharpening without added support.
Targeted RL (GRPO) runs: When starting from SFT (non-trivial support), RL raises Accuracy@1 in-domain and on small within-modality shifts, narrowing the gap to Pass@K—like going from B to A on first-try tests because the right answers were already in the model’s shortlist. When starting from the base (weak support), RL gains are small and Pass@K can drop, especially under bigger shifts, meaning the model gets more confident but not more capable.

Surprising findings:

RL can lower Pass@K even when it helps Accuracy@1—evidence of sharpening that prunes diversity. This is fine when there’s robust support but risky when support is thin.
Vision towers in general-purpose VLMs already do a decent job on some medical tasks; failures there are often about alignment and selection rather than pure perception.

From analysis to practice: Applying the boundary-aware recipe, the authors start with OctoMed (already SFT-bridged), then do small-scale, modality-balanced RL on PMC-VQA multiple choice using consistency-aware GRPO. On six benchmarks (PMC, MMMU-medical slice, MedX-M, PathVQA, SLAKE, VQA-Rad), their 7B model attains the top average among comparable Qwen2.5-VL-based medical VLMs under the same decoding—akin to getting the best overall GPA across different subjects, not just excelling in one.

Big picture: The numbers tell a coherent story—SFT expands the reachable answers (support), RL converts that into better first-try performance (sampling efficiency), and the combination, used in the right order, yields stronger, more reliable medical VLMs.

05Discussion & Limitations

Limitations:

RL usually doesn’t expand support; it sharpens. If support is weak (low Pass@K), RL may make the model confidently wrong and reduce Pass@K.
Transfer is uneven: in-domain and small within-modality shifts benefit most; cross-modality gains are limited without extra bridging data.
Linear probing shows some perception bottlenecks remain; no amount of RL fixes a weak vision tower on hard modalities.
Reward design depends on verifiable tasks (e.g., multiple choice). Open-ended reasoning and free-form report writing are harder to reward reliably.

Required resources:

A reasonably strong base VLM with a competent vision tower.
Domain SFT data to bridge support where needed.
Small but balanced RL data with verifiable rewards, plus stability tricks (e.g., consistency-aware GRPO) to avoid collapse.
Validation sets to measure Accuracy@1 and Pass@K and to monitor support during RL.

When NOT to use this approach:

When Pass@K is low: prioritize collecting proximal data and SFT first; applying RL early can hurt.
When tasks lack reliable automatic rewards; noisy rewards can misguide the policy.
When cross-modality transfer is required without any bridging data; expect limited gains from RL alone.

Open questions:

How to safely sharpen without shrinking support—can we design RL that explicitly preserves Pass@K while lifting Accuracy@1?
Can we automatically detect modality gaps and suggest targeted bridging data collection?
How to extend verifiable rewards beyond multiple choice to structured reasoning and report generation?
What curriculum (order of modalities, difficulty) best raises support before RL?
Can joint training with vision improvements (e.g., vision-side adaptation) reduce perception bottlenecks so that subsequent RL has more to sharpen?

06Conclusion & Future Work

Three-sentence summary: This paper shows that in medical VLMs, SFT tends to expand what answers the model can produce (support), while RL mainly sharpens choices so the first try is right more often. Measuring both Accuracy@1 and Pass@K reveals when there’s hidden capability to convert into reliability. Used in the right order—bridge then sharpen—RL delivers consistent gains without collapsing diversity.

Main achievement: A boundary-aware recipe that diagnoses support, bridges it with SFT when low, and applies RL to improve sampling efficiency only when support is sufficient—validated by strong average performance across six medical VQA benchmarks with a small, balanced RL dataset.

Future directions: Develop RL methods that explicitly protect Pass@K while raising Accuracy@1; integrate vision-side adaptation to reduce perception bottlenecks; design better verifiable rewards for free-form reasoning; and automate modality-aware data balancing.

Why remember this: RL isn’t a magic wand—it’s a spotlight. First stock the library (SFT), then use RL to quickly open the right page. This simple order-of-operations makes medical AI more reliable, efficient, and ready for real-world clinical support.

Practical Applications

•Before any RL, compute Accuracy@1 and Pass@K on a small validation set to diagnose support.
•If Pass@K is low, prioritize collecting proximal, modality-balanced data and run SFT to expand support.
•If Pass@K is decent, apply RL with verifiable multiple-choice rewards to raise first-try accuracy.
•Use consistency-aware GRPO (or similar) to stabilize RL on small medical datasets.
•Balance RL training data across modalities to avoid overfitting sharpening to a single image type.
•Monitor Pass@K during RL; stop or adjust if support begins to collapse.
•Run linear probing on the vision tower to identify perception bottlenecks before language-side tuning.
•For deployment, use the boundary-aware recipe as a checklist: measure → bridge (SFT) → sharpen (RL) → re-measure.
•Prefer small within-modality transfers for RL benefits; plan extra SFT for larger or cross-modality shifts.

Version: 1