Large Multimodal Models as General In-Context Classifiers

Marco Garosi; Matteo Farina; Alessandro Conti; Massimiliano Mancini; Elisa Ricci

Large Multimodal Models as General In-Context Classifiers

Intermediate

Marco Garosi, Matteo Farina, Alessandro Conti et al.2/26/2026

arXiv

Key Summary

•People often pick CLIP-like models for image labeling, but this paper shows that large multimodal models (LMMs) can be just as good—or even better—when you give them a few examples in the prompt (in-context learning).
•In closed-world tasks (pick a label from a known list), LMMs start weaker than CLIP, but with 16 example pairs they can catch up and sometimes win.
•In open-world tasks (no fixed list of labels), LMMs are naturally a better fit, but they can stumble if the examples you feed them have messy or wrong labels.
•The paper introduces CIRCLE, a simple, training-free loop that lets the model clean up its own example labels step by step, using the context as feedback.
•CIRCLE turns noisy, unlabeled examples into helpful guidance and consistently beats CLIP-style open-world baselines across many datasets.
•LMMs with in-context learning gain accuracy faster per extra example than CLIP adapters, showing better sample efficiency.
•Naive context can hurt LMMs (random or poorly labeled examples can confuse them), which is why careful refinement like CIRCLE matters.
•Across 10 diverse datasets, LMMs plus CIRCLE set a strong baseline for open-world classification without any extra training.
•This work reframes LMMs as general-purpose, unified classifiers, reducing the need for separate specialized models.
•The big idea: don’t just pick a model—shape its context; with the right examples and refinement, LMMs can shine at classification.

Why This Research Matters

This work shows that with the right examples, one flexible model can handle many labeling jobs without retraining. That means faster setup for new products (like a store adding a new category) and easier personalization (like adapting to a user’s photography style). In open-world situations—like social media uploads or robotics in the wild—CIRCLE helps models settle on the right level of detail and avoid generic or off-target labels. Because it’s training-free, teams can deploy improvements immediately by just curating or refining context, saving time and compute. It also reduces the need to maintain many separate, specialized models. Overall, it democratizes strong classification by shifting the power from heavy training to smart context design.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how you can figure out a new board game faster when someone shows you a few example turns? A couple of good examples make everything click.

🥬 Filling (The Actual Concept)

What it is: This paper studies how large multimodal models (LMMs) can learn to classify images very well just from examples placed in their prompt (in-context learning), and introduces CIRCLE, a way to clean up those examples when labels aren’t given.
How it works: The authors compare two families of models. First are Vision-Language Models (VLMs) like CLIP, which link images to short texts and excel at “pick a label from this list” tasks. Second are Large Multimodal Models (LMMs), chatty models that can look at an image and generate free-form answers. They test both on many datasets with and without a fixed set of labels, and they study what happens when you add a few examples to the prompt. Finally, they design CIRCLE, a loop that lets an LMM improve the example labels by using the context itself.
Why it matters: If LMMs can classify well by just seeing a few examples at test time, people won’t always need custom-trained classifiers. That’s faster, cheaper, and more flexible in the real world.

🍞 Bottom Bread (Anchor) Imagine asking, “What dog breed is this?” If you show the model 16 mini flashcards (image + breed), LMMs jump from shaky guesses to very strong answers—often matching or beating CLIP.

🍞 Top Bread (Hook) Imagine two sports: one is darts (you must hit exact spots you’re told about), the other is treasure hunting (you must discover what matters as you go). Models face both kinds of tasks.

🥬 Filling (New Concepts in Sandwich Style)

Closed-World Classification
- What it is: You pick one label from a known list (like 100 flower names).
- How it works: The model compares the image with each label and chooses the best match.
- Why it matters: It’s simple and fast, but only works if you already know all possible labels.
- Anchor: Picking the best of [“cat, dog, rabbit”] for a pet photo.
Open-World Classification
- What it is: The model freely names what’s in the image without a fixed list (like “bird of paradise flower”).
- How it works: It generates text and must choose the right specificity (not too generic, not too detailed).
- Why it matters: This is how the real world looks—no one hands you a tidy label list.
- Anchor: Saying “red sports car, Ferrari 488” instead of just “car.”
Vision-Language Models (VLMs)
- What it is: Models (like CLIP) that measure how well images match short texts.
- How it works: They put images and texts in the same space and pick the closest match.
- Why it matters: They’re great at fast, zero-shot closed-world picks.
- Anchor: Matching a food photo with “a photo of pancakes.”
Large Multimodal Models (LMMs)
- What it is: Chatty models that can read images and write answers.
- How it works: They turn image patches into tokens, attend over them with text, and generate responses.
- Why it matters: They’re flexible and natural for open-world answers but need good guidance.
- Anchor: Explaining “This looks like a yellow raft used for river rafting.”
In-Context Learning (ICL)
- What it is: Teaching by example right in the prompt, no training.
- How it works: You show a few (image, label) pairs, then ask for a new label; the model imitates the pattern.
- Why it matters: Saves time and adapts the model to your task on the fly.
- Anchor: Showing 8 dog-breed flashcards before asking about a new dog photo.

🍞 Bottom Bread (Anchor) Before this paper, most people said “Use CLIP for classification.” After, it’s more like “Use LMMs with a few good examples—and use CIRCLE for messy open-world cases.”

02Core Idea

🍞 Top Bread (Hook) You know how a teacher sometimes says, “Let’s look at a few examples, then you try”? That tiny bit of context can turn a tough problem into an easy one.

🥬 Filling (The Actual Concept)

The Aha! Moment in one sentence: LMMs become strong, general classifiers when you feed them a few smart examples in the prompt—and in open-world settings, a simple self-cleaning loop (CIRCLE) makes those examples reliable without any extra training.
Multiple Analogies (3 ways):
1. Recipe card: If you hand a chef two sample dishes, they can copy the style for a new dish; CIRCLE is the chef tasting and adjusting the samples until they’re just right.
2. Study guide: A few solved problems unlock the pattern; CIRCLE is the student checking answers with friends and fixing mistakes.
3. Treasure map: The examples are map clues; CIRCLE is re-drawing the map after spotting wrong clues, so you don’t get lost.
Before vs After
- Before: LMMs looked worse than CLIP for closed-world picks, and they got confused in open-world because they didn’t know the right level of detail.
- After: With just 8–16 examples, LMMs catch up in closed-world and, with CIRCLE, they shine in open-world—no extra training needed.
Why It Works (intuition, no equations here):
- Attention latches onto patterns you show it. Examples tell the LMM which visual details and words matter for your task (fine-grained vs general).
- In open-world, the hard part is deciding “how specific.” CIRCLE aligns the examples with each other, so the model settles on a consistent granularity.
- Iteration helps: if a pseudo-label is too vague (“bird”), the presence of other similar examples pushes it toward the needed detail (“sparrow”) across rounds.
Building Blocks
1. Closed-world: Compare CLIP (with cache adapters) vs LMM (with example pairs in prompt). LMMs leap in accuracy as examples grow.
2. Open-world: Start with unlabeled images, let the LMM guess labels (pseudo-labels), then iteratively refine those labels by using the rest of the context as guidance.
3. CIRCLE loop: Leave-one-out context → relabel → repeat a few rounds → use the final, cleaned context to answer the user’s query.
4. Metrics that check both “Did it name the right thing?” and “Does the wording make sense?” ensure we reward both correctness and clarity.

🍞 Bottom Bread (Anchor) Show 16 car photos with their models, and suddenly the LMM stops saying “car” and starts saying “Aston Martin Virage Coupe.” In open-world, CIRCLE makes those example labels crisp and consistent first, so the final answer is trustworthy.

03Methodology

🍞 Top Bread (Hook) Imagine you’re organizing a photo contest. If you show the judges a few perfect examples of each category, they judge better. If the examples have wrong captions, they get confused—unless they meet and fix the captions together.

🥬 Filling (The Actual Concept)

High-level pipeline: Input image → Build or refine a context (few examples) → Ask the model to classify → Output label(s).

Step A: Closed-world with CLIP-like VLMs (Zero-shot and Tip-Adapter)

What happens: For a known label list S, CLIP picks the label whose text is most similar to the image in a shared space.
Why this step exists: It’s fast and strong for zero-shot when classes are known.
Simple formula: $\hat{s} = \arg\max_{s \in S} \langle \phi_{vis}(v), \phi_{text}(s) \rangle$ . Example: Suppose scores for three labels are 0.2 (cat), 0.8 (dog), 0.5 (car). The argmax picks dog (0.8).
Few-shot refinement (Tip-Adapter): add a cache of k labeled examples per class; boost the score by image-to-image similarity. Formula: $\text{score}(s) = \underbrace{\langle \phi_{vis}(v), \phi_{text}(s) \rangle}_{\text{zero-shot}} + \omega \sum_{x \in C_s} \langle \phi_{vis}(v), \phi_{vis}(x) \rangle$ . Example: Two classes A,B. Zero-shot scores: A=0.7, B=0.6. Let $\omega=0.5$ . Sum of similarities to A-examples is 2.0; to B-examples is 1.0. New scores: A= $0.7+0.5\times2.0=1.7$ , B= $0.6+0.5\times1.0=1.1$ . Pick A.

Step B: Closed-world with LMMs (Vanilla ICL)

What happens: Put k example (image, label) pairs into the prompt, then ask a multiple-choice question (MCQ) with the class list; the LMM generates the choice.
Why this step exists: It teaches the LMM the labeling style and what visual details matter.
Example with actual data: If you show 8 (image, “daisy”) and (image, “sunflower”) pairs, the LMM learns that round yellow centers map to “sunflower” and answers “sunflower” for a new similar photo.

Step C: Open-world with LMMs (Pseudo-labeling and CIRCLE)

What happens: Now there’s no fixed label list. We only have m unlabeled images as context. First, the LMM guesses a label (pseudo-label) for each image. Then CIRCLE refines those guesses by using the other examples as guidance, iteratively.
Why this step exists: Naive pseudo-labels can be vague (“bird”) or list-like (“bird, plane, kite”). Refinement synchronizes specificity across examples.
Leave-one-out refinement (CIRCLE): For each image $x_j$ , use all the other (image, pseudo-label) pairs as context and ask again. Do this for all j in parallel, then repeat for a few rounds. Formula: $\hat{y}^{(t)}_j = \text{LMM}(C^{(t-1)}_{-j}, x_j)$ , where $C^{(t-1)}_{-j}$ is the context of all other images with their latest pseudo-labels. Example: Three images {A,B,C}. Round 0 guesses: A=“car (0.6), truck (0.4)”, B=“car (0.3), truck (0.7)”, C=“car (0.8), truck (0.2)”. Round 1, relabel A using {B,C}: both favor “car,” so A’s probabilities shift to car 0.75, truck 0.25. Do the same for B and C. After 3 rounds, labels stabilize.

Step D: Ask the final question

What happens: After refinement, you present the query image plus the cleaned context to the LMM and ask, “What is in this image?”
Why this step exists: The final, coherent context steers the LMM to the right specificity and format.
Example: After CIRCLE, context labels become “Aston Martin Virage Coupe” instead of just “car,” so the model answers at the proper detail level.

Secret Sauce: Why this method is clever

In-context signals are powerful but fragile; CIRCLE turns raw guesses into a consistent, self-agreed vocabulary before using them as teaching examples.
No training needed: All improvements come from better context, not parameter updates.
Sample efficiency: LMMs gain more per example than CLIP adapters in many datasets, making them strong when labeled examples are scarce.

🍞 Bottom Bread (Anchor) Think of a study group that first writes rough answers, then swaps papers to fix each other’s wording. After a few rounds, everyone’s answers match in style and detail—so the final test (your query) goes great.

04Experiments & Results

🍞 Top Bread (Hook) If a soccer team practices with smart drills, they play better on game day. These experiments are the drills that showed LMMs how to win at classification.

🥬 Filling (The Actual Concept)

The Test: The authors measured how well models name what’s in a photo across 10 datasets (like Caltech101, Flowers102, Stanford Cars). They tested closed-world (pick from a list) and open-world (free answer). They used metrics for correctness (Llama Inclusion, LI) and for how semantically close the wording is (SS, bCS, mCS).
The Competition: CLIP variants (ViT-B/32, B/16, L/14) with Tip-Adapter vs LMMs (Qwen2-VL 7B, Qwen2.5-VL 7B, LLaVA OneVision 7B, Phi-3.5-Vision, Phi-4-Multimodal). All used the same number of examples when applicable (4, 8, 16 shots).

Closed-world highlights (pick from a list):

LMMs start behind CLIP in zero-shot, but with more examples they surge. With 16 shots, strong LMMs match strong CLIP (e.g., Qwen2-VL 7B matches CLIP ViT-L/14).
Biggest jumps: Phi-3.5-Vision improved by about +29.2% over its own zero-shot, and Qwen2-VL 7B by +17.7% with 16 shots.
Sample efficiency: LMMs’ relative gains can exceed +50% on some datasets with 16 shots, while CLIP-B/32 peaked around +25%—that’s roughly $2× better$ yield per example.

Open-world highlights (no list):

Zero-shot: LMMs produced richer language but sometimes missed exact labels; CLIP-based retrieval baselines did well on similarity, but worse on strict inclusion.
Naive ICL and naive pseudo-labeling can hurt (both correctness and semantic scores) because inconsistent or noisy context confuses the model.
CIRCLE flips the script: After iterative refinement, LMMs consistently beat both naive ICL and CLIP baselines across correctness (LI) and semantic metrics (SS, mCS). For example, Qwen2-VL on prototypical tasks reached $LI ≈ 91$ .5 with CIRCLE, clearly above alternatives.
Fine-grained wins: On very fine-grained sets (like aircraft and cars), CIRCLE pushed models from generic outputs (“airplane,” “car”) to precise categories (e.g., “MD-80,” “Aston Martin Virage Coupe”). Some LI scores nearly doubled (e.g., Phi-3.5-Vision in very fine-grained jumping $from ≈54$ $to ≈99$ .6 with CIRCLE in the default setting reported).

Surprising findings:

Random context can be harmful. For LMMs, irrelevant examples reduce accuracy a lot—context quality really matters.
More examples help semantics: As shots increase from 4 to 16, SS and mCS rise notably; LI stays strong and stable, showing you gain detail without losing correctness.
Iteration helps but saturates: More CIRCLE rounds improve results over plain pseudo-labeling, with diminishing returns after a few rounds.

Scoreboard in plain speak:

Closed-world: With enough examples (≈16), LMMs can play in CLIP’s league and sometimes win.
Open-world: LMMs are naturally suited here; with CIRCLE, they set a robust training-free baseline, beating CLIP-style retrieval methods on both naming the right thing and wording it well.

🍞 Bottom Bread (Anchor) Think of CIRCLE as a team huddle: everyone corrects their captions first. Then, when the real photo comes in, the team nails both the exact label and the right level of detail.

05Discussion & Limitations

🍞 Top Bread (Hook) Even the best recipe can flop if you use the wrong ingredients or rush the steps. Let’s be honest about where this method may stumble and what’s needed to use it well.

🥬 Filling (The Actual Concept)

Limitations
- Noisy or off-topic examples can mislead LMMs; naive ICL may even do worse than zero-shot.
- CIRCLE is training-free, so it can converge to labels that are consistent but not the ones you actually want (task misalignment), especially if the unlabeled pool is ambiguous.
- Iterative refinement and large contexts cost compute and memory; streaming setups can be heavy if you update often.
Required Resources
- An LMM capable of image understanding, a way to pack k–16 examples into the prompt, and enough GPU memory/time to run several refinement rounds.
- Optional: a small, cheap model (like CLIP-B/32) to help retrieve similar examples for better context.
When NOT to Use
- If you have a perfectly labeled, fixed label set and need ultra-fast inference, a CLIP-style classifier may be simpler.
- If your unlabeled examples are from a totally different domain than your queries, CIRCLE may refine toward the wrong vocabulary.
- If you can’t afford multiple passes (extreme latency constraints), iterative refinement may be too slow.
Open Questions
- Can lightweight tuning (e.g., adapters or LoRA) stabilize refinement without full training?
- How to auto-detect and remove bad examples during refinement?
- Can we compress or summarize context to cut cost while keeping quality?
- How to pick the “right specificity” automatically across wildly mixed datasets?

🍞 Bottom Bread (Anchor) If your examples are relevant and you can spare a few refinement rounds, CIRCLE turns a shaky study guide into a polished cheat sheet the model can reliably follow.

06Conclusion & Future Work

🍞 Top Bread (Hook) You’ve seen how a few great examples can turn a tough puzzle into a clear picture. That’s the heart of this paper.

🥬 Filling (The Actual Concept)

3-sentence summary: The authors show that Large Multimodal Models, when fed a handful of in-context examples, become strong image classifiers, rivaling CLIP-like models in closed-world tasks. In open-world tasks, naive examples can mislead LMMs, so they introduce CIRCLE, a training-free loop that cleans up example labels by using the context itself. Across 10 datasets, CIRCLE delivers consistent gains in both correctness and semantic quality, setting a robust baseline for open-world classification.
Main achievement: Reframing LMMs as unified, general-purpose classifiers and demonstrating that context engineering—especially via CIRCLE—can unlock their full discriminative power without any extra training.
Future directions: Add light parameter-efficient tuning to stabilize refinement; smarter example selection and filtering; faster, memory-savvy context building for streaming.
Why remember this: Don’t just pick a model—shape its context. With the right examples and a simple self-correcting loop, LMMs can handle both “pick from a list” and “name anything” problems with surprising strength.

🍞 Bottom Bread (Anchor) In practice, that means fewer custom models and more flexible systems: show the LMM a few good flashcards, let CIRCLE tidy them up, and you’re ready to classify the real world.

Practical Applications

•E-commerce onboarding: Classify new products by showing a few example listings; use CIRCLE to refine noisy seller tags.
•Digital asset management: Auto-label large photo libraries with a few hand-picked exemplars per category.
•Wildlife monitoring: From a handful of species photos, recognize related animals in trail-cam images with open-world naming.
•Quality control in manufacturing: Show a few defect examples; let CIRCLE refine defect names and catch subtle variants.
•Medical imaging triage (with caution): Use a small set of curated cases to guide open-world descriptions for preliminary tagging.
•News and social media moderation: Identify content types (e.g., protest, sports) via refined context without fixed taxonomies.
•Robotics perception: Provide a few on-site examples so robots label tools and parts accurately in new environments.
•Cultural heritage archiving: Refine granular labels for artworks or artifacts when standard vocabularies aren’t available.
•Personal photo assistants: Learn a household’s own label style (e.g., “grandma’s garden flowers”) from a few examples.
•Field research: Rapidly tag images in expeditions (flora, fauna, geology) using only unlabeled shots and CIRCLE to stabilize names.

Version: 1

Key Summary

Why This Research Matters

Turn this paper into a decision

Detailed Explanation

01Background & Problem Definition

02Core Idea

03Methodology

04Experiments & Results

05Discussion & Limitations

06Conclusion & Future Work

Practical Applications

Notes