Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks

Qihua Dong; Kuo Yang; Lin Ju; Handong Zhao; Yitian Zhang; Yizhou Wang; Huimin Zeng; Jianglin Lu; Yun Fu

Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks

Intermediate

Qihua Dong, Kuo Yang, Lin Ju et al.2/27/2026

arXiv

Key Summary

•This paper builds a new test called Ref-Adv to check if AI can truly match tricky sentences to the right thing in a picture.
•Older tests were too easy because sentences were very short, pictures had few lookalike objects, and extra clues let models take shortcuts.
•Ref-Adv pairs real photos with careful sentences that include just enough clues (often with negation like “not the one…”) and many hard distractors that look similar to the target.
•A two-stage, LLM-assisted pipeline first finds what makes objects similar and different, then writes short but sufficient expressions; humans triple-check each example.
•Modern multimodal AIs that score over 90% on classic benchmarks drop to roughly 50–60% on Ref-Adv, showing real gaps in visual reasoning.
•Bag-of-words tests (scrambling word order) and descriptor-deletion tests both hurt accuracy more on Ref-Adv, proving it really needs careful language understanding.
•More distractors mean lower scores, and Chain-of-Thought helps more on Ref-Adv than on older benchmarks, highlighting its heavier reasoning demand.
•A smaller public slice, Ref-Adv-s (1,142 cases), enables fair and reproducible comparisons across models.
•This benchmark aims to guide future AI toward honest, step-by-step grounding instead of shallow shortcut tricks.

Why This Research Matters

In everyday life, instructions are often short but precise, and they depend on checking each word—Ref-Adv trains and tests AI to handle exactly that. For accessibility, more trustworthy visual grounding means safer navigation and clearer scene descriptions for visually impaired users. In homes and factories, robots that can pick “the blue screw not the black one” reduce costly mistakes. In education and search, better grounding helps organize photos and videos by the exact item you mean, not rough guesses. In medicine and science, precise language-to-region grounding supports finer annotations and fewer labeling errors. And for safety, a benchmark that discourages shortcuts pushes the field toward AIs that explain and verify their choices instead of guessing.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you and a friend are looking at a busy photo of a classroom. Your friend says, “The kid in the front row wearing a red hoodie, not the one with glasses.” You don’t just look for “kid”; you check the front row, hoodie color, and whether they have glasses. That’s real, step-by-step thinking.

🥬 Referring Expression Comprehension (REC):

What it is: REC is the job of finding exactly which region in an image a sentence is talking about.
How it works: (1) Read the sentence. (2) List the clues (like color, position, action). (3) Scan the image for candidates. (4) Match each clue to each candidate. (5) Pick the one that fits all clues.
Why it matters: Without REC, an AI might know there’s a “dog” but not which dog you mean—so it can’t follow precise instructions. 🍞 Anchor: In a photo with three dogs, the sentence “the small tan dog under the table” should point to the right one, not any dog.

🍞 Hook: You know how some people are good at reading stories and others are good at reading maps? Some AIs can do both at once.

🥬 Multimodal Large Language Models (MLLMs):

What it is: MLLMs are AIs that understand and use both pictures and words together.
How it works: (1) A vision encoder turns images into features. (2) A language model reads text. (3) A bridge aligns both so clues from words can be checked against image parts.
Why it matters: Without MLLMs, we’d need separate tools for words and pictures, and they wouldn’t coordinate well. 🍞 Anchor: When you ask an AI “Which slice is the burnt toast on the left?”, an MLLM can look at the photo and the text to box the right toast.

🍞 Hook: Solving a Where’s Waldo page means carefully comparing small details, not just spotting “a person.”

🥬 Visual Reasoning:

What it is: Visual reasoning is thinking with what you see—comparing shapes, positions, and attributes to make a correct choice.
How it works: (1) Propose likely objects. (2) Check visual clues (color, size, pose). (3) Compare similar candidates. (4) Eliminate mismatches. (5) Select the best match.
Why it matters: Without visual reasoning, an AI grabs the first lookalike and gets fooled by similar objects. 🍞 Anchor: Picking “the cup that’s empty in the center of the pizzas” requires noticing it’s empty and it’s centered, not just “a cup.”

🍞 Hook: Reading a recipe in order matters—“add eggs then flour” is different from “flour then eggs.”

🥬 Textual Reasoning:

What it is: Textual reasoning is understanding word meanings and their order so the full sentence makes sense.
How it works: (1) Parse who/what/where/not. (2) Track relationships (e.g., “next to,” “behind”). (3) Respect order and negations (“not the one with…”). (4) Combine all clues logically.
Why it matters: Without textual reasoning, the AI may miss “not,” mix up locations, or treat all words as equal. 🍞 Anchor: “The man in the background jumping to return the tennis ball” isn’t just “man jumping”—background and action timing matter.

🍞 Hook: If a class has only one cat drawing on the wall, you don’t need many clues to find it, even if the sentence gives extra details.

🥬 Grounding Shortcuts:

What it is: Grounding shortcuts are easy tricks that let a model guess the right object using only part of the clues.
How it works: (1) Use the object category alone. (2) Ignore word order. (3) Match a single standout attribute. (4) Win by luck because there aren’t many distractors.
Why it matters: Shortcuts inflate scores but don’t prove real understanding. 🍞 Anchor: If there’s only one “mouse” in the image, “the DELL mouse connected to the laptop” can be solved by just spotting the only mouse, not by reading “DELL” or “connected.”

🍞 Hook: Playing spot-the-difference is harder when the pictures are almost the same.

🥬 Distractors and Hard Distractors:

What it is: Distractors are same-category objects that are not the target; hard distractors are those that match many—but not all—clues.
How it works: (1) Find all objects of the same type. (2) Identify lookalikes that share several features with the target. (3) Use just-enough clues to separate the true target from these lookalikes.
Why it matters: Without hard distractors, models don’t need to check all clues; they can guess. 🍞 Anchor: Two nearly identical glasses, where only one is “less full and closer to the corner,” force careful checking of fullness and location.

The world before: Classic REC datasets like RefCOCO/+/g gave models many easy wins because expressions were often super short (about 3 words) and images had few distractors, so models could succeed with category guesses or single-attribute matches. The problem: These settings didn’t require solid textual reasoning or strong visual comparisons, so high scores didn’t mean deep understanding. Failed attempts: Some new datasets made expressions extremely long (90+ words), which felt unnatural and still allowed shortcuts because the number of clues far exceeded the number of similar objects. Others switched away from the classic “one image, one target” setup or used rigid templates that felt less like real language. The gap: We needed a benchmark with natural language, classic setup, enough similar distractors, and just-enough clues to demand true reasoning (including negation). Real stakes: Better REC means safer robots (“hand me the blue cup not the mug”), clearer photo assistants (“tag the dog behind the couch, not the one on it”), and more reliable accessibility tools (correctly describing which person is doing what) in everyday life.

02Core Idea

🍞 Hook: Think of a treasure hunt where the clues are short but perfectly chosen so only one spot fits, even though many spots look similar.

🥬 The Aha! Moment:

What it is: Ref-Adv pairs real photos with concise, minimally sufficient expressions and deliberately includes hard distractors so every clue must be checked to find the one true target.
How it works: (1) Ensure multiple same-category objects. (2) Identify the most similar pair. (3) Elicit just-enough distinguishing attributes. (4) Write a short, natural expression (often using negation). (5) Human-verify that you can’t solve it by shortcuts.
Why it matters: Without this design, models can score high by using partial clues; with it, they must truly read and reason. 🍞 Anchor: “The less full glass closer to the corner rather than further from it” works only if the model checks both fullness and location among multiple similar glasses.

Three analogies:

Detective lens: Older datasets were like a case with one main suspect. Ref-Adv invites identical twins and cousins to the lineup, and the clues are just enough to pick the right one.
Lock-and-key: Each expression is a key that fits only one lock; hard distractors are near-miss locks that make quick jiggling fail.
Recipe precision: Short but exact steps (“stir 10 seconds, not 30; add salt, not sugar”) force you to follow the whole recipe; skipping any step ruins the dish.

Before vs After:

Before: Average 3-word expressions, few distractors, word order often didn’t matter; models aced benchmarks but may not have truly understood.
After: About 11.5-word expressions, ~4 distractors on average, explicit negation (~21%), and provable sensitivity to word order and descriptor completeness; models now drop to ~50–60% Acc@0.5, revealing gaps.

Why it works (intuition, no equations):

If many lookalikes exist, the model can’t rely on category-only guesses. If expressions include just-enough clues (sometimes via negation, like “not the one with a metal frame”), the model must align each textual clue to the image. Removing word order or deleting a clue now breaks the match more often—evidence that full reasoning is required.

Building blocks (the idea in pieces):

🍞 Hook: Finding your friend in a crowd is easier if you know just the right clues.
🥬 Discriminators (group vs in-pair):
- What it is: Group discriminators separate the similar pair from the rest; in-pair discriminators separate the two lookalikes.
- How it works: (1) Split candidates into A (most similar two) and B (others). (2) List attributes that distinguish A from B (group-level). (3) List attributes that distinguish the two inside A (in-pair). (4) Compose a sentence using one of each.
- Why it matters: Without this split, expressions get bloated or ambiguous, enabling shortcuts. 🍞 Anchor: “The person without a necklace (A vs B), wearing sunglasses (in-pair).”
🍞 Hook: Saying “not the one with stripes” can be clearer than listing every other detail.
🥬 Negation:
- What it is: A way to rule out a near lookalike by saying what the target is not.
- How it works: (1) Spot an attribute that only the hard distractor has. (2) Say “not the one with [that].” (3) Combine with a positive clue.
- Why it matters: Without negation, you might need too many words; with it, short expressions stay precise. 🍞 Anchor: “The modern sofa, not the one with a metal frame.”
🍞 Hook: If you scramble the words in a sentence, it’s harder to understand.
🥬 Sensitivity checks (anti-shortcut tests):
- What it is: Bag-of-words (scramble order) and descriptor-deletion tests probe whether models truly need all clues.
- How it works: (1) Shuffle word order—see if accuracy drops more. (2) Delete one descriptor—see if accuracy drops more.
- Why it matters: Bigger drops mean the dataset genuinely requires reading and reasoning. 🍞 Anchor: Changing “the small tan dog under the table” to “under tan table dog small the” should hurt performance if order matters—as it should in Ref-Adv.

03Methodology

At a high level: Image + instances → Filter for many same-category objects → Tag candidates → Find most similar pair (group A) vs others (group B) → Elicit group and in-pair discriminators → Compose short, sufficient sentences (with optional negation) → Human verification → Final Ref-Adv pair.

🍞 Hook: Think of setting up a fair puzzle—enough pieces to be challenging, but not a chaotic mess.

🥬 Input Preparation:

What it is: Choosing images and preparing candidates.
How it works: (1) Sample real images from COCO and OpenImages with panoptic instance annotations. (2) Keep images with at least three same-category candidates to ensure real distractor pressure. (3) Overlay unobtrusive number tags on candidates (similar to Set-of-Marks) for internal reference during curation.
Why it matters: Without enough same-category objects, the puzzle is too easy: the model could guess from category alone. 🍞 Anchor: Pick a street photo with 5 similar cars instead of one lonely car.

🍞 Hook: When two people look almost the same, you first say how they differ from the crowd, then how they differ from each other.

🥬 Similarity Judgment (Group A vs Group B):

What it is: A step where an LLM (e.g., GPT-4o) identifies the most similar pair and gathers distinguishing attributes.
How it works: (1) Partition candidates into group A (the most similar two—target + hard distractor) and group B (the rest). (2) Generate exactly two group-level discriminators (A vs B). (3) Generate four in-pair discriminators (two noticeable, two unnoticeable) that separate the two in A.
Why it matters: Without this structure, expressions become overspecified or ambiguous, enabling shortcuts. 🍞 Anchor: For three people, you might note A vs B: “short hair” vs “long hair,” then in-pair: “sunglasses” vs “no sunglasses.”

🍞 Hook: Clear, short directions beat long rambles—if they still uniquely point to one answer.

🥬 Referring Expression Generation (Minimal but Sufficient):

What it is: Composing natural sentences that use one group-level and one in-pair discriminator.
How it works: (1) Two strategies: use target’s positive attributes or use negation of the distractor’s attributes. (2) Encourage variety in phrasing while banning number-tag mentions. (3) Provide the image again at this stage to improve accuracy and naturalness.
Why it matters: Without minimal sufficiency, expressions become too long and let models ignore many clues; with it, every clue counts. 🍞 Anchor: “The cup that is empty (in-pair) and positioned at the center of the pizzas (group).”

🍞 Hook: Even great assistants can make mistakes; that’s why editors exist.

🥬 Human Verification (3-annotator agreement):

What it is: A final correctness and clarity check.
How it works: (1) Annotators try grounding on the original image (no number tags). (2) They see ground-truth afterward to reflect and finalize. (3) Keep a pair only if all three agree it’s correct, unambiguous, and includes hard distractors. (4) LLM-authored keep $rate ≈ 18$ .7% (highly selective).
Why it matters: Without this step, subtle errors or ambiguities could slip in and weaken the benchmark. 🍞 Anchor: If two annotators find the wrong chair, the example is rejected.

🍞 Hook: “Not the one with…” can be a lifesaver when two items look the same.

🥬 Negation and Diversity:

What it is: Purposeful use of “not” to exclude a near-twin, plus multiple paraphrases.
How it works: (1) Identify a hard-distractor-only feature (e.g., metal frame). (2) Say “not the one with a metal frame.” (3) Mix with a positive clue to keep it short and decisive.
Why it matters: Without negation, expressions might bloat or fail to cleanly separate lookalikes. 🍞 Anchor: “The modern sofa, not the one with a metal frame.”

🍞 Hook: Good tests check that guessing tricks don’t work.

🥬 Anti-Shortcut Quality Checks:

Bag-of-Words Test:
- What it is: Scramble word order and see if accuracy drops more on Ref-Adv.
- How it works: Replace “a red ball with yellow stripes” by “with yellow red ball stripes a” and evaluate models.
- Why it matters: If order doesn’t matter, the model isn’t really reading. 🍞 Anchor: On Ref-Adv, models drop more than on old datasets—showing order matters here.
Descriptor-Deletion Test:
- What it is: Remove one descriptor and re-evaluate.
- How it works: Extract descriptors with a strong model, delete one, rewrite the sentence, and see if accuracy falls.
- Why it matters: If accuracy doesn’t drop, the clue was unnecessary—a shortcut warning. 🍞 Anchor: On Ref-Adv, accuracy falls more when you remove a single clue—so each clue counts.
Fixed-Prompt Bias Test:
- What it is: Replace expressions with a generic “the one” to detect training-set bias.
- How it works: Keep the image, ask for a box with almost no text guidance, and measure accuracy.
- Why it matters: Lower scores on Ref-Adv suggest it resists dataset-specific biases. 🍞 Anchor: Models do better with “the one” on old datasets than on Ref-Adv—Ref-Adv is harder to game.

🍞 Hook: A measuring stick should scale from easy to hard without breaking.

🥬 Evaluation Setup and Metrics:

What it is: Testing many modern MLLMs with fair prompts and clear metrics.
How it works: (1) Evaluate 13+ MLLMs (Qwen, InternVL, Gemini, GPT-4o, Claude, GLM, CogVLM). (2) Use Acc@IoU 0.5/0.75/0.9 and mean accuracy. (3) Include Chain-of-Thought (CoT) where supported; use Set-of-Marks (SoM) for models with limited grounding.
Why it matters: Without common metrics and careful prompting, comparisons aren’t meaningful. 🍞 Anchor: A model that gets 54% Acc@0.5 on Ref-Adv may have >90% on RefCOCO—but the new test uncovers real reasoning gaps.

Secret sauce (what’s clever):

Intentionally pairing just-enough language with hard distractors forces models to check every word. The two-stage LLM pipeline prevents overspecification, and the triple human check keeps only cases that truly demand both textual and visual reasoning.

04Experiments & Results

🍞 Hook: It’s like graduating from a spelling bee (easy words) to a debate (hard thinking). Scores naturally drop, but now they mean more.

🥬 The Test (What and why?):

What it is: Measure how well models ground concise, demanding expressions in real images with many hard distractors.
How it works: Evaluate multiple MLLMs on Ref-Adv using accuracy at IoU thresholds (0.5/0.75/0.9) and mean accuracy. Also run ablations: bag-of-words (scramble order), one-descriptor deletion, and fixed “the one” prompt.
Why it matters: If models only succeed when text is simple and images are uncluttered, we haven’t measured true reasoning. 🍞 Anchor: A model that struggles when words are scrambled or a single clue is removed is showing it needed those details—good!

🥬 The Competition (Compared to what?):

What it is: Classic benchmarks RefCOCO, RefCOCO+, and RefCOCOg.
How it works: Compare model accuracy on those vs. Ref-Adv under the same IoU rules. Analyze by distractor count and with/without Chain-of-Thought.
Why it matters: A big drop on Ref-Adv shows old tests were too easy or shortcut-prone. 🍞 Anchor: Qwen2.5-VL-72B: >90% on classic sets but ~54% Acc@0.5 on Ref-Adv.

🥬 The Scoreboard (with context):

Headline: Despite strong classic scores, state-of-the-art models fall to roughly 50–60% Acc@0.5 on Ref-Adv. For instance, Qwen2.5-VL-72B reaches about 54.1% at IoU 0.5.
CoT helps more here: Chain-of-Thought generally improves performance on Ref-Adv (where reasoning is heavier) but is less helpful—or even harmful—on older datasets with short expressions and few distractors.
Distractors bite: Accuracy drops as similar-object count rises, especially 7+ distractors, confirming the benchmark’s pressure.
Bias resistance: With the fixed prompt “the one,” models score higher on old sets than on Ref-Adv, suggesting Ref-Adv is less exploitable by dataset biases.
Order and sufficiency matter: Bag-of-words (orderless text) reduces accuracy more on Ref-Adv than classics; removing a single descriptor hurts more too—evidence that models must read carefully and use every clue. 🍞 Anchor: On Ref-Adv-s (1,142 public cases), smaller models lag, and “thinking” variants beat “instruct” variants at the same sizes—consistent with the benchmark’s reasoning demand.

🥬 Surprising Findings:

Even very large MLLMs often pick the hard distractor instead of the true target—showing perception or language-integration slips mid-reasoning.
Some models gain from CoT on Ref-Adv but not on older datasets, aligning with the idea that thinking steps help only when the task truly needs them.
Normalized vs. absolute coordinate outputs and specialized prompts matter less than the core difficulty injected by hard distractors and minimal expressions—everyone is challenged. 🍞 Anchor: In qualitative examples, models lay out a sensible plan (“find all glasses,” “compare fullness,” “check corner proximity”) yet still box the wrong glass—tiny visual differences now matter a lot.

05Discussion & Limitations

🍞 Hook: A good obstacle course reveals strengths and weaknesses—but it’s not the whole world.

🥬 Limitations:

Data source scope: Ref-Adv uses COCO and OpenImages photos; diversity is strong but still bounded by those datasets.
LLM-authored expressions: Despite a strict two-stage pipeline and triple human checks, a small fraction could still contain subtle redundancy or bias.
Bounding box focus: The benchmark evaluates box grounding, not full segmentation or 3D understanding.
Naturalness vs. minimality: Short, sufficient expressions are natural, but some users may prefer even richer narratives; balance is a design choice.

🥬 Required Resources:

For evaluation: Access to MLLMs (open or proprietary), evaluation scripts, and Ref-Adv/Ref-Adv-s data. GPU use is helpful but not mandatory for API-based models.
For curation replication: Panoptic annotations, LLM APIs, and human annotators for verification.

🥬 When NOT to Use:

If you need long storytelling or dialogue-heavy, multi-turn grounding, Ref-Adv’s short, minimal expressions may not match your use case.
If you need pixel-exact segmentation or 3D pose reasoning, a segmentation or 3D benchmark may be better.
If you want category detection without instance-level disambiguation, object detection benchmarks suffice.

🥬 Open Questions:

How to train models that remain robust as distractors scale from 3 to 10+ while staying efficient?
What architectures best fuse negation and relational cues with fine-grained visual details?
Can we design training that directly penalizes shortcut behavior and rewards all-clue reasoning?
How can we extend Ref-Adv to video, 3D scenes, or interactive robotics without re-introducing shortcuts? 🍞 Anchor: Think of the next version as adding moving scenes, hands-on interaction, and even more lookalikes—without losing the “just-enough clues” spirit.

06Conclusion & Future Work

Three-sentence summary: Ref-Adv is a new benchmark that pairs real photos with short but sufficient sentences and many hard distractors so AI must truly use every clue. Compared to classic datasets where models score above 90%, the same models drop to about 50–60% on Ref-Adv, proving that older tests allowed shortcuts. Careful ablations (word-order scrambling, descriptor deletion, and bias checks) confirm that Ref-Adv genuinely demands both textual and visual reasoning.

Main achievement: The paper delivers a rigorously curated, bias-resistant REC benchmark—with public Ref-Adv-s and unified evaluation code—that reliably exposes reasoning gaps in modern MLLMs.

Future directions: Train models with anti-shortcut objectives, expand to video/3D/robotics settings, and explore richer negation and relational language. Investigate architectures and training recipes (e.g., thinking modes, grounding-specific rewards) that improve fine-grained perception plus language logic.

Why remember this: Ref-Adv raises the bar from “can you guess?” to “did you truly understand?”, giving the community a clearer compass for building trustworthy multimodal AI that follows instructions precisely in the messy, lookalike-heavy real world.

Practical Applications

•Evaluate new multimodal models on Ref-Adv-s to detect shortcut reliance before deployment.
•Train with negation-rich, hard-distractor data to improve robustness against lookalikes.
•Use descriptor-completeness loss (penalize missing or ignored clues) to reduce shortcut behavior.
•Add Chain-of-Thought prompts during inference on hard grounding tasks to boost accuracy.
•In robotics, test grasping policies on Ref-Adv-like scenes to ensure precise object selection.
•For accessibility apps, validate that phrases with negation (“not the one…”) are grounded correctly.
•In retail image search, require models to disambiguate among many similar products using minimal cues.
•Adopt the two-stage discriminator pipeline for creating internal QA datasets with minimal but sufficient expressions.
•Integrate descriptor-deletion checks into evaluation dashboards to monitor reliance on every clue.
•Use fixed-prompt bias tests (“the one”) to detect dataset bias before claiming state-of-the-art results.

Version: 1