Specificity-aware reinforcement learning for fine-grained open-world classification

Samuele Angheben; Davide Berasi; Alessandro Conti; Elisa Ricci; Yiming Wang

Specificity-aware reinforcement learning for fine-grained open-world classification

Intermediate

Samuele Angheben, Davide Berasi, Alessandro Conti et al.3/3/2026

arXiv

Key Summary

•This paper teaches AI to name things in pictures very specifically (like “golden retriever” instead of just “dog”) without making more mistakes.
•The trick is a new training method called SpeciaRL that uses reinforcement learning with a smart, changing reward.
•A separate judge AI checks each answer and sorts it into six categories from worst (Wrong) to best (More Specific).
•The model tries many answers per image; the best one in that group sets the bar for what counts as “good enough” specificity for that image.
•If the model can truly be specific on that image, it gets rewarded only for being that specific; if it can’t, it isn’t punished for staying safely general.
•This dynamic reward plugs into a popular RL method (GRPO) and runs efficiently during training.
•Trained only on birds, the method still improved other domains like flowers, food, pets, cars, and airplanes (out-of-domain generalization).
•It outperforms prompting, supervised fine-tuning, and standard RL in balancing correctness and specificity.
•It reaches a better overall score (harmonic mean) than competing methods on multiple datasets.
•Bottom line: encourage the model to be as specific as it can be—but not more specific than it actually knows.

Why This Research Matters

Better, more specific naming makes real systems more helpful. Shopping search improves when the AI says the exact product model, not just the category. Wildlife apps become trusted field guides by naming precise species when possible, and staying safely general when not. Safety systems avoid over‑confident guesses because the reward scheme doesn’t force impossible specificity. Support teams and accessibility tools can give clearer, more accurate descriptions for photos and videos. Finally, by training on one domain and helping many, SpeciaRL reduces data needs and supports broader, fairer deployment.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how a friend might say “That’s a bird,” but your bird‑expert friend says, “That’s a golden‑winged warbler”? The second friend is more specific, and that’s super helpful when you care about details.

🥬 Filling (The Actual Concept)

What it is: This paper is about teaching vision‑language AIs to give names that are both correct and as specific as possible in an open‑world, fine‑grained setting.
How it works (story of the world before):
1. Closed‑world vs. open‑world: Old image classifiers chose from a fixed list (closed‑world). Real life is open‑world—no fixed list—and the AI can answer with any word.
2. Fine‑grained is hard: Saying “bird” is easy; saying “golden‑winged warbler” is hard. Many AIs play it safe and stay generic.
3. New LMMs (Large Multimodal Models) can see pictures and use language to reason, but they still often answer too generically.
Why it matters: If we only reward being more specific, models may guess wrong more often. If we only reward being correct, models may stay too generic. We need both.

🍞 Bottom Bread (Anchor) Imagine a shopping app that sees a shoe and says “shoe.” Helpful? A little. But if it confidently says “Converse Chuck Taylor high‑top,” that’s much better for search, pricing, and recommendations—if it’s correct.

🍞 Top Bread (Hook) Think of quizzes where you can write any answer. That’s like open‑world classification—no choices given, just your best words.

🥬 Filling (The Actual Concept)

What it is: Open‑world fine‑grained classification means naming the main object with no preset label list and aiming for the most precise correct label.
How it works (what was tried):
1. Prompting: Telling the model “Be specific” increases specificity but also increases mistakes.
2. Supervised fine‑tuning: Training on “gold answers” can push specificity but often reduces correctness in new domains.
3. Standard RL (reward only exact match): Encourages super specificity, but when the fine name isn’t really known, the model guesses and gets more things wrong.
Why it matters: A brittle push for specificity hurts reliability; a timid push for correctness keeps answers bland.

🍞 Bottom Bread (Anchor) If you ask for a dog breed and the model confidently says the wrong breed, that’s worse than correctly saying “dog.”

🍞 Top Bread (Hook) You know how students sometimes know the exact answer but don’t always recall it on the first try? If they try a few times, one of their attempts might nail it.

🥬 Filling (The Actual Concept)

What it is: The authors noticed the model actually does know many fine‑grained names—it just doesn’t always sample them on a single try.
How it works:
1. Try multiple generations per image (like multiple guesses).
2. The best attempt of those gives a clue about the model’s real capability on that sample.
3. Use that best attempt as a moving target: reward answers that are at least that specific, without pushing beyond what the model currently can do.
Why it matters: This avoids forcing unrealistic specificity that would cause wrong answers.

🍞 Bottom Bread (Anchor) For a flower the model can only name as “iris” (not the exact species), we don’t punish it for not saying “Iris germanica.” But when it can say “Iris germanica,” we reward it for doing so.

🍞 Top Bread (Hook) Imagine a referee who doesn’t just say “right” or “wrong,” but also tells you if your answer is too general or nicely specific.

🥬 Filling (The Actual Concept)

What it is: A separate language model acts as a judge, sorting predictions into six categories: Wrong, Abstain, Generic, Less Specific, Specific, and More Specific.
How it works:
1. The judge compares the model’s answer with the ground‑truth fine label.
2. It assigns a category that reflects both correctness and how detailed the answer is.
3. These categories power the training rewards and the evaluation metrics.
Why it matters: Without a fine‑grained judge, we can’t teach or measure “specific but correct” well in an open world.

🍞 Bottom Bread (Anchor) If the label is “samoyed,” then “dog” is Generic (correct but broad), “spitz” might be Less Specific (close parent), “samoyed” is Specific, and “poodle” is Wrong.

🍞 Top Bread (Hook) Think of combining two goals on a report card: be right and be detailed.

🥬 Filling (The Actual Concept)

What it is: The paper evaluates two quantities—correctness and specificity—and combines them with the harmonic mean so a model must do well on both.
How it works (with math you can feel):
1. Correctness counts non‑wrong answers.
2. Specificity scores how informative non‑wrong answers are.
3. The harmonic mean punishes imbalance (you can’t ace one and flunk the other).
Why it matters: This aligns training and evaluation with what we actually want.

🍞 Bottom Bread (Anchor) A model that always says “animal” might be correct a lot, but it’s not helpful. Another that guesses fancy breed names and is often wrong isn’t helpful either. The best model is both right and as specific as it can be.

02Core Idea

🍞 Top Bread (Hook) Imagine playing basketball where your coach sets your challenge based on your best moves that day: if your layup is solid today, you’re asked to nail layups; if your three‑pointer is on fire, you’re encouraged to shoot threes. You get pushed, but not past what you can really do.

🥬 Filling (The Actual Concept)

What it is (one sentence): SpeciaRL is a reinforcement learning method that gives a dynamic, sample‑specific reward based on the best specificity the model can currently achieve, so it becomes as specific as it can be without becoming wrong.
Multiple Analogies:
1. Schoolwork: If you can solve 2‑step equations today, your homework targets 2‑step problems—not calculus; tomorrow, if you’re ready for 3 steps, your homework upgrades.
2. Cooking: If you can reliably make pasta and sometimes perfect lasagna, your practice focuses on best‑so‑far lasagna quality; if not, you’re rewarded for making pasta perfectly rather than burning the lasagna.
3. Hiking: Trails are chosen to match your current peak performance so you’re always improving but not overexerting to the point of injury.
Before vs After: • Before: Prompting “be specific” often increased mistakes; supervised fine‑tuning and static RL rewards over‑pushed specificity and hurt correctness. • After: A moving target per image (based on the model’s own best attempt) rewards true specificity the model can handle and avoids punishing safe, correct answers when finer detail isn’t yet in reach.
Why It Works (intuition):
1. Best‑of‑N attempts reveal the model’s hidden knowledge for that exact image.
2. Rewarding answers that meet or exceed that discovered specificity nudges the model toward reliably producing its best‑known detail.
3. If the model can’t yet be specific on a case, it isn’t forced to guess wrongly, protecting correctness.
Building Blocks (each with a Sandwich): • 🍞 You know how a librarian files books from general to very specific topics? 🥬 The Category Ladder: Wrong ≺ Abstain ≺ Generic ≺ Less Specific ≺ Specific ≺ More Specific. The judge puts each answer on this ladder; higher is more informative. Without this ladder, the model can’t tell whether “dog” vs. “samoyed” is better. 🍞 Example: For truth “golden retriever,” “dog” (Generic), “retriever” (Less Specific), “golden retriever” (Specific).

• 🍞 Imagine you take 10 photos and pick your favorite. 🥬 Best‑of‑N (BoN): The model tries N answers for one image; the best category found becomes the reference for that image’s reward. Without BoN, we can’t know what the model is truly capable of on that sample. 🍞 Example: If among 10 tries the best is “retriever” (Less Specific), then “dog” (Generic) won’t earn full credit; “golden retriever” (Specific) would.

• 🍞 Picture a friendly referee who knows your current top speed and judges you by it. 🥬 Dynamic Reward Rule: We set a per‑image minimum category equal to the best attempt in the current group (with sensible edge cases). Predictions meeting or exceeding this bar get rewarded; wrong ones don’t. Without a dynamic rule, rewards either push too hard or too little. 🍞 Example: If best attempt is Generic, Generic gets rewarded; if best is Specific, only Specific (or More Specific) gets rewarded.

• 🍞 Think of team practice where the best players in a drill set the standard and everyone learns from that round. 🥬 GRPO (Group Relative Policy Optimization): A popular RL method that compares multiple outputs in a group and updates the model toward higher‑reward ones. Without GRPO, training would be slower or less stable. 🍞 Example: For one image, 10 answers compete; those matching the dynamic bar pull the model’s policy in their direction.
Math Without Fear: • Correctness is the fraction of non‑wrong answers: $\text{correctness}=1-\frac{n_W}{n}$ . Example: If $n=100$ and $n_W=15$ , then correctness $=1-\frac{15}{100}=0.85$ . • Specificity scores categories (A:1, G:2, S−:3, S/S+:4) and averages over non‑wrong answers: $\text{specificity}=\frac{\sum s(c_i)}{n-n_W}\times\frac{1}{4}$ . Example: If 60 non‑wrong answers have scores summing to 180, specificity $=\frac{180}{60}\times\frac{1}{4}=0.75$ . • Balance both with the harmonic mean: $\text{HM}=\frac{2\cdot \text{spec} \cdot \text{corr}}{\text{spec}+\text{corr}}$ . Example: If spec $=0.90$ and corr $=0.80$ , HM $=\frac{2\cdot0.90\cdot0.80}{0.90+0.80}=\frac{1.44}{1.70}\approx0.85$ .

🍞 Bottom Bread (Anchor) After training with SpeciaRL, the model that used to say “airplane” now often says “Boeing 737,” and the one that used to say “flower” now often says “Iris germanica”—but it doesn’t over‑guess when it doesn’t know, so correctness stays high.

03Methodology

🍞 Top Bread (Hook) Imagine a science fair project where you run several tests, have a judge grade each test from basic to brilliant, then you study the best result and practice to consistently reach that level.

🥬 Filling (The Actual Concept)

High‑level overview: Input image → Generate N answers (policy model) → Judge sorts each answer (specificity ladder) → Set per‑image bar based on the best answer → Give rewards to answers that meet/exceed that bar → Update the model with GRPO → Output a more specific yet correct classifier over time.

Step‑by‑Step (with Sandwich mini‑explanations):

🍞 You know how you might brainstorm multiple titles for a story before choosing one? 🥬 Generate multiple predictions: The policy model (e.g., Qwen2.5VL‑7B) produces N open‑ended answers for each image. Without multiple tries, we can’t discover the current best specificity for that image. 🍞 Anchor: For a cat image with truth “Birman,” the model might output: “cat,” “long‑haired cat,” “Birman,” “ragdoll,” etc.
🍞 Think of a teacher who grades on both correctness and detail. 🥬 LLM‑as‑a‑judge categorization: A strong language model compares each prediction to the fine‑grained ground truth and assigns a category: Wrong (W), Abstain (A), Generic (G), Less Specific (S−), Specific (S), More Specific (S+). Without this sorting, rewards can’t reflect informative detail. 🍞 Anchor: If truth is “golden retriever,” then “dog”→G, “retriever”→S−, “golden retriever”→S, “poodle”→W.
🍞 Imagine picking your best throw in a set to set your personal challenge. 🥬 Best‑of‑N (BoN): Choose the prediction with the highest category for that image among the N outputs; this is the per‑image capability signal. Without BoN, we can’t set a sensible, fair target. 🍞 Anchor: If the best among 10 answers is “retriever” (S−), the bar shouldn’t be higher than S− for this image.
🍞 Think of a video game that rewards you for matching or beating your previous high score on that level. 🥬 Dynamic reference category c*: Set c* to the best category found; fix edge cases (if best is S+, use S; if best is W, use A). Predictions with $category ≥ c$ * get reward 1; wrong ones get 0. Without dynamics, the model could be over‑ or under‑challenged. 🍞 Anchor: If best is G, then G/S−/S/S+ get rewarded; if best is S, only S/S+ do.
🍞 Team practice time: better plays guide the team. 🥬 GRPO update: For each image’s group of N answers, we update the policy to increase likelihood of rewarded answers. Without GRPO, we wouldn’t effectively learn from group comparisons. 🍞 Anchor: Answers that meet the bar pull the model toward reliably reproducing them next time.

Key Math (with simple numbers):

Correctness: $\text{correctness} = 1 - \frac{n_W}{n}$ . Example: For 200 images with 30 wrong, correctness $= 1 - \frac{30}{200} = 0.85$ .
Specificity score setup: Assign $s(A)=1, s(G)=2, s(S^{-})=3, s(S)=s(S^{+})=4$ . Example: If you got 10 A’s, 15 G’s, 5 S−’s, and 20 S’s (no wrong), total score $=10\cdot1 + 15\cdot2 + 5\cdot3 + 20\cdot4 = 10 + 30 + 15 + 80 = 135$ .
Specificity averaging: $\text{specificity}=\frac{\sum s(c_i)}{n - n_W} \times \frac{1}{4}$ . Example: With 50 non‑wrong and total score 135, specificity $= \frac{135}{50}\times\frac{1}{4}=2.7\times0.25=0.675$ .
Harmonic mean: $\text{HM} = \frac{2\cdot \text{spec}\cdot \text{corr}}{\text{spec} + \text{corr}}$ . Example: If spec $=0.675$ and corr $=0.85$ , HM $= \frac{2 \cdot 0.675 \cdot 0.85}{0.675 + 0.85}=\frac{1.1475}{1.525}\approx0.75$ .
BoN definition: If the N predictions for image i are $\{p_1,\dots,p_N\}$ and the judge gives categories $\{c_1,\dots,c_N\}$ , then the BoN pick is the one with the highest category. Example: If categories are [G, S−, G, A, S], then BoN is S (the best).
Dynamic reward: For an image with best category c*, reward $r^*(p)=1$ if the prediction’s category $\ge c^*$ ; otherwise $0$ . Example: If $c^*=\text{S−}$ , then S−/S/S+ get 1; G/A/W get 0.

Concrete Data Walkthrough:

Suppose the truth is “Cessna 172.” The model outputs N=5 answers: [“aircraft”(G), “Cessna”(S−), “Cessna 172”(S), “Boeing 737”(W), “airplane”(G)]. The judge categories: [G, S−, S, W, G].
BoN is S (“Cessna 172”). So c* = S. Reward answers with S or S+. Here, only “Cessna 172” gets 1; others get 0.
GRPO uses this signal to push the model toward reliably saying “Cessna 172.”

Secret Sauce (why this is clever):

It personalizes the challenge per image based on what the model can actually do right now.
It maximizes specificity without tipping into wrong guesses.
It’s efficient: BoN uses the same N outputs GRPO already generates—no extra runs.

🍞 Bottom Bread (Anchor) After cycles of this training, for a food image with truth “Greek salad,” the model more often says “Greek salad” (Specific) rather than just “salad” (Generic), but won’t over‑guess a wrong dish name when it’s unsure.

04Experiments & Results

🍞 Top Bread (Hook) Think of a spelling bee team trained only on bird names, then tested on flowers, foods, pets, cars, and planes. Can they still do great? Surprisingly, yes—if they learned to be confidently specific without guessing.

🥬 Filling (The Actual Concept)

The Test (what and why):
1. Datasets: Fine‑grained (Flowers102, Food101, OxfordPets) and very fine‑grained (StanfordCars, FGVCAircraft).
2. Out‑of‑domain training: The model is trained on a small bird dataset (CUB, 3k samples) and then tested on other domains to check generalization.
3. Metrics: Correctness, specificity, and their harmonic mean (HM) to reflect the balance.
4. Judge: A large LLM categorizes answers for fair, fine‑grained scoring.
The Competition (baselines): • Zero‑shot LMMs (e.g., Qwen2.5VL, InternVL2.5), a retrieval approach (CaSED), prompting “Be specific,” supervised fine‑tuning (SFT), and standard reinforcement fine‑tuning (RFT) with static rewards.
The Scoreboard (simple context): • Fine‑grained average: SpeciaRL achieved a top HM around where others trade off—like getting an A when others hover between B and B+. • Very fine‑grained: SpeciaRL keeps a strong HM, staying competitive or better than training‑based baselines. • Prompting “Be specific” raises specificity but also increases wrong answers—like speeding up and missing turns. • Static‑reward RFT pushes hard toward exact labels and can hurt correctness out‑of‑domain; SpeciaRL keeps correctness healthier while gaining specificity.
A Few Numbers (intuition, not full tables): • Harmonic Mean (HM) example: Suppose SpeciaRL has specificity $=0.92$ and correctness $=0.85$ . Then HM $=\frac{2\cdot0.92\cdot0.85}{0.92+0.85}=\frac{1.564}{1.77}\approx0.883$ . If a baseline has specificity $=0.82$ and correctness $=0.83$ , HM $\approx\frac{2\cdot0.82\cdot0.83}{1.65}\approx0.825$ . So SpeciaRL’s balance is clearly stronger. • Best‑of‑64 (upper bound): When trying 64 times per image at inference (expensive), the base model gets much higher specificity and correctness, proving it knows more than it shows. SpeciaRL moves the trained model closer to this potential without 64‑try inference.
Surprising Findings:
1. The base reasoning model actually knows many fine‑grained labels but tends to default to generic answers on a single try.
2. When trained with dynamic rewards, the model’s thinking becomes more goal‑oriented: the final answer better reflects the fine clues already noted in its own reasoning.
3. Training on a single domain (birds) still improved other domains (flowers/food/pets/cars/planes), meaning the method boosts general reasoning habits, not just memorized names.
Why It Matters (in results terms): • SpeciaRL often increases both specificity and correctness vs. the base model in fine‑grained tests. • Compared to SFT and static‑reward RFT, SpeciaRL achieves a better correctness‑specificity balance (higher HM), which is exactly the practical goal in open‑world naming. • On broader benchmarks that use inclusion and similarity measures, SpeciaRL matches or exceeds state‑of‑the‑art reasoning LMMs across multiple metrics, confirming its robustness.

🍞 Bottom Bread (Anchor) Imagine image search: after SpeciaRL, searching for a “Cessna 172” pulls the right plane more often than just any “airplane,” while still avoiding over‑confident wrong guesses. That’s better for users and systems alike.

05Discussion & Limitations

🍞 Top Bread (Hook) If you always try to guess the exact bird species, you might be wrong a lot. If you always say just “bird,” you’re not very helpful. The trick is knowing how far you can go for each picture.

🥬 Filling (The Actual Concept)

Limitations: • The judge isn’t perfect: An LLM‑judge can sometimes mis‑categorize. Training is fairly robust to small errors, but very noisy judges would hurt. • Compute needs: RL training with multiple rollouts and a judge model requires GPUs and time, though the method is optimized to reuse rollouts. • Ceiling set by current ability: The dynamic bar matches what the model can do now. If the base knowledge is missing, SpeciaRL won’t invent it; it nudges expression, not encyclopedic memory. • Very fine‑grained edge cases: In ultra‑subtle categories (cars by trim/year), sometimes specificity pushes lead to occasional wrong specifics; careful tuning helps.
Required Resources: • A capable multimodal base model, a reliable LLM‑judge, and an RL training loop (e.g., GRPO framework). • A modest, well‑labeled training set (here, 3k birds) is enough to shift behavior.
When NOT to Use: • If you already have a strict closed set of labels and exact matching is required, simpler closed‑set methods or static verifiable rewards might suffice. • If you lack any ground‑truth labels, even small ones, reward shaping becomes hard. • If compute is extremely constrained, multi‑rollout RL may be impractical.
Open Questions: • Can we combine dynamic rewards with knowledge expansion (e.g., retrieval) to raise the model’s true ceiling? • How to auto‑tune N (rollouts) and the RL optimizer for best stability and speed? • Can we improve the judge via self‑consistency or multi‑judge voting to further reduce noise? • How does this extend to multi‑object scenes and attributes (color, part, material) simultaneously?

🍞 Bottom Bread (Anchor) Think of SpeciaRL as a careful coach: it won’t teach new facts out of thin air, but it gets the most precise, correct answers out of the knowledge your model already has, one image at a time.

06Conclusion & Future Work

🍞 Top Bread (Hook) You know how the best teachers don’t just say “be more detailed”—they help you be as detailed as you can be today and a little more tomorrow.

🥬 Filling (The Actual Concept)

3‑Sentence Summary:
1. This paper introduces SpeciaRL, a reinforcement learning method that trains vision‑language models to be as specific as they can be while staying correct in open‑world naming.
2. A judge model categorizes each prediction by correctness and specificity, and the best answer among several tries sets a dynamic per‑image bar for rewards.
3. This improves the balance of specificity and correctness across multiple fine‑ and very fine‑grained datasets, even when training only on a different domain.
Main Achievement: Showing that a dynamic, verifier‑based reward anchored to the model’s own best attempt per image can reliably steer models toward truly achievable specificity without sacrificing correctness.
Future Directions: • Mix in retrieval or external knowledge to lift the model’s ceiling so the dynamic bar can rise over time. • Explore multi‑object and attribute‑rich scenes; extend the judge to handle multiple labels and relationships. • Automate hyperparameter choices (like rollouts N) and test other on‑policy RL variants for even better stability.
Why Remember This: SpeciaRL changes the question from “Be specific at all costs?” to “Be as specific as you can, correctly.” That simple shift leads to smarter, safer, and more useful open‑world classifiers.

Practical Applications

•E-commerce image search that recognizes exact models or styles without over‑guessing.
•Wildlife identification apps that name species when confident and stay general when not.
•Content moderation that specifies object types precisely to apply the right policy.
•Robotics perception that tags parts and tools at the right level of detail for safer actions.
•Medical pre‑screening tools that describe findings precisely while avoiding risky guesses.
•Photo library organization that auto‑labels albums with specific, correct categories.
•Inventory and warehouse vision that identifies item variants and SKUs reliably.
•News and sports media tagging that distinguishes fine‑grained entities (teams, models).
•Accessibility descriptions that provide detailed yet dependable captions.
•Quality control in manufacturing, naming defect types or component variants correctly.

Version: 1

Key Summary

Why This Research Matters

Turn this paper into a decision

Detailed Explanation

01Background & Problem Definition

02Core Idea

03Methodology

04Experiments & Results

05Discussion & Limitations

06Conclusion & Future Work

Practical Applications

Notes