Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks
Key Summary
- ā¢This paper builds a new test called Ref-Adv to check if AI can truly match tricky sentences to the right thing in a picture.
- ā¢Older tests were too easy because sentences were very short, pictures had few lookalike objects, and extra clues let models take shortcuts.
- ā¢Ref-Adv pairs real photos with careful sentences that include just enough clues (often with negation like ānot the oneā¦ā) and many hard distractors that look similar to the target.
- ā¢A two-stage, LLM-assisted pipeline first finds what makes objects similar and different, then writes short but sufficient expressions; humans triple-check each example.
- ā¢Modern multimodal AIs that score over 90% on classic benchmarks drop to roughly 50ā60% on Ref-Adv, showing real gaps in visual reasoning.
- ā¢Bag-of-words tests (scrambling word order) and descriptor-deletion tests both hurt accuracy more on Ref-Adv, proving it really needs careful language understanding.
- ā¢More distractors mean lower scores, and Chain-of-Thought helps more on Ref-Adv than on older benchmarks, highlighting its heavier reasoning demand.
- ā¢A smaller public slice, Ref-Adv-s (1,142 cases), enables fair and reproducible comparisons across models.
- ā¢This benchmark aims to guide future AI toward honest, step-by-step grounding instead of shallow shortcut tricks.
Why This Research Matters
In everyday life, instructions are often short but precise, and they depend on checking each wordāRef-Adv trains and tests AI to handle exactly that. For accessibility, more trustworthy visual grounding means safer navigation and clearer scene descriptions for visually impaired users. In homes and factories, robots that can pick āthe blue screw not the black oneā reduce costly mistakes. In education and search, better grounding helps organize photos and videos by the exact item you mean, not rough guesses. In medicine and science, precise language-to-region grounding supports finer annotations and fewer labeling errors. And for safety, a benchmark that discourages shortcuts pushes the field toward AIs that explain and verify their choices instead of guessing.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Hook: Imagine you and a friend are looking at a busy photo of a classroom. Your friend says, āThe kid in the front row wearing a red hoodie, not the one with glasses.ā You donāt just look for ākidā; you check the front row, hoodie color, and whether they have glasses. Thatās real, step-by-step thinking.
š„¬ Referring Expression Comprehension (REC):
- What it is: REC is the job of finding exactly which region in an image a sentence is talking about.
- How it works: (1) Read the sentence. (2) List the clues (like color, position, action). (3) Scan the image for candidates. (4) Match each clue to each candidate. (5) Pick the one that fits all clues.
- Why it matters: Without REC, an AI might know thereās a ādogā but not which dog you meanāso it canāt follow precise instructions. š Anchor: In a photo with three dogs, the sentence āthe small tan dog under the tableā should point to the right one, not any dog.
š Hook: You know how some people are good at reading stories and others are good at reading maps? Some AIs can do both at once.
š„¬ Multimodal Large Language Models (MLLMs):
- What it is: MLLMs are AIs that understand and use both pictures and words together.
- How it works: (1) A vision encoder turns images into features. (2) A language model reads text. (3) A bridge aligns both so clues from words can be checked against image parts.
- Why it matters: Without MLLMs, weād need separate tools for words and pictures, and they wouldnāt coordinate well. š Anchor: When you ask an AI āWhich slice is the burnt toast on the left?ā, an MLLM can look at the photo and the text to box the right toast.
š Hook: Solving a Whereās Waldo page means carefully comparing small details, not just spotting āa person.ā
š„¬ Visual Reasoning:
- What it is: Visual reasoning is thinking with what you seeācomparing shapes, positions, and attributes to make a correct choice.
- How it works: (1) Propose likely objects. (2) Check visual clues (color, size, pose). (3) Compare similar candidates. (4) Eliminate mismatches. (5) Select the best match.
- Why it matters: Without visual reasoning, an AI grabs the first lookalike and gets fooled by similar objects. š Anchor: Picking āthe cup thatās empty in the center of the pizzasā requires noticing itās empty and itās centered, not just āa cup.ā
š Hook: Reading a recipe in order mattersāāadd eggs then flourā is different from āflour then eggs.ā
š„¬ Textual Reasoning:
- What it is: Textual reasoning is understanding word meanings and their order so the full sentence makes sense.
- How it works: (1) Parse who/what/where/not. (2) Track relationships (e.g., ānext to,ā ābehindā). (3) Respect order and negations (ānot the one withā¦ā). (4) Combine all clues logically.
- Why it matters: Without textual reasoning, the AI may miss ānot,ā mix up locations, or treat all words as equal. š Anchor: āThe man in the background jumping to return the tennis ballā isnāt just āman jumpingāābackground and action timing matter.
š Hook: If a class has only one cat drawing on the wall, you donāt need many clues to find it, even if the sentence gives extra details.
š„¬ Grounding Shortcuts:
- What it is: Grounding shortcuts are easy tricks that let a model guess the right object using only part of the clues.
- How it works: (1) Use the object category alone. (2) Ignore word order. (3) Match a single standout attribute. (4) Win by luck because there arenāt many distractors.
- Why it matters: Shortcuts inflate scores but donāt prove real understanding. š Anchor: If thereās only one āmouseā in the image, āthe DELL mouse connected to the laptopā can be solved by just spotting the only mouse, not by reading āDELLā or āconnected.ā
š Hook: Playing spot-the-difference is harder when the pictures are almost the same.
š„¬ Distractors and Hard Distractors:
- What it is: Distractors are same-category objects that are not the target; hard distractors are those that match manyābut not allāclues.
- How it works: (1) Find all objects of the same type. (2) Identify lookalikes that share several features with the target. (3) Use just-enough clues to separate the true target from these lookalikes.
- Why it matters: Without hard distractors, models donāt need to check all clues; they can guess. š Anchor: Two nearly identical glasses, where only one is āless full and closer to the corner,ā force careful checking of fullness and location.
The world before: Classic REC datasets like RefCOCO/+/g gave models many easy wins because expressions were often super short (about 3 words) and images had few distractors, so models could succeed with category guesses or single-attribute matches. The problem: These settings didnāt require solid textual reasoning or strong visual comparisons, so high scores didnāt mean deep understanding. Failed attempts: Some new datasets made expressions extremely long (90+ words), which felt unnatural and still allowed shortcuts because the number of clues far exceeded the number of similar objects. Others switched away from the classic āone image, one targetā setup or used rigid templates that felt less like real language. The gap: We needed a benchmark with natural language, classic setup, enough similar distractors, and just-enough clues to demand true reasoning (including negation). Real stakes: Better REC means safer robots (āhand me the blue cup not the mugā), clearer photo assistants (ātag the dog behind the couch, not the one on itā), and more reliable accessibility tools (correctly describing which person is doing what) in everyday life.
02Core Idea
š Hook: Think of a treasure hunt where the clues are short but perfectly chosen so only one spot fits, even though many spots look similar.
š„¬ The Aha! Moment:
- What it is: Ref-Adv pairs real photos with concise, minimally sufficient expressions and deliberately includes hard distractors so every clue must be checked to find the one true target.
- How it works: (1) Ensure multiple same-category objects. (2) Identify the most similar pair. (3) Elicit just-enough distinguishing attributes. (4) Write a short, natural expression (often using negation). (5) Human-verify that you canāt solve it by shortcuts.
- Why it matters: Without this design, models can score high by using partial clues; with it, they must truly read and reason. š Anchor: āThe less full glass closer to the corner rather than further from itā works only if the model checks both fullness and location among multiple similar glasses.
Three analogies:
- Detective lens: Older datasets were like a case with one main suspect. Ref-Adv invites identical twins and cousins to the lineup, and the clues are just enough to pick the right one.
- Lock-and-key: Each expression is a key that fits only one lock; hard distractors are near-miss locks that make quick jiggling fail.
- Recipe precision: Short but exact steps (āstir 10 seconds, not 30; add salt, not sugarā) force you to follow the whole recipe; skipping any step ruins the dish.
Before vs After:
- Before: Average 3-word expressions, few distractors, word order often didnāt matter; models aced benchmarks but may not have truly understood.
- After: About 11.5-word expressions, ~4 distractors on average, explicit negation (~21%), and provable sensitivity to word order and descriptor completeness; models now drop to ~50ā60% Acc@0.5, revealing gaps.
Why it works (intuition, no equations):
- If many lookalikes exist, the model canāt rely on category-only guesses. If expressions include just-enough clues (sometimes via negation, like ānot the one with a metal frameā), the model must align each textual clue to the image. Removing word order or deleting a clue now breaks the match more oftenāevidence that full reasoning is required.
Building blocks (the idea in pieces):
- š Hook: Finding your friend in a crowd is easier if you know just the right clues.
- š„¬ Discriminators (group vs in-pair):
- What it is: Group discriminators separate the similar pair from the rest; in-pair discriminators separate the two lookalikes.
- How it works: (1) Split candidates into A (most similar two) and B (others). (2) List attributes that distinguish A from B (group-level). (3) List attributes that distinguish the two inside A (in-pair). (4) Compose a sentence using one of each.
- Why it matters: Without this split, expressions get bloated or ambiguous, enabling shortcuts. š Anchor: āThe person without a necklace (A vs B), wearing sunglasses (in-pair).ā
- š Hook: Saying ānot the one with stripesā can be clearer than listing every other detail.
- š„¬ Negation:
- What it is: A way to rule out a near lookalike by saying what the target is not.
- How it works: (1) Spot an attribute that only the hard distractor has. (2) Say ānot the one with [that].ā (3) Combine with a positive clue.
- Why it matters: Without negation, you might need too many words; with it, short expressions stay precise. š Anchor: āThe modern sofa, not the one with a metal frame.ā
- š Hook: If you scramble the words in a sentence, itās harder to understand.
- š„¬ Sensitivity checks (anti-shortcut tests):
- What it is: Bag-of-words (scramble order) and descriptor-deletion tests probe whether models truly need all clues.
- How it works: (1) Shuffle word orderāsee if accuracy drops more. (2) Delete one descriptorāsee if accuracy drops more.
- Why it matters: Bigger drops mean the dataset genuinely requires reading and reasoning. š Anchor: Changing āthe small tan dog under the tableā to āunder tan table dog small theā should hurt performance if order mattersāas it should in Ref-Adv.
03Methodology
At a high level: Image + instances ā Filter for many same-category objects ā Tag candidates ā Find most similar pair (group A) vs others (group B) ā Elicit group and in-pair discriminators ā Compose short, sufficient sentences (with optional negation) ā Human verification ā Final Ref-Adv pair.
š Hook: Think of setting up a fair puzzleāenough pieces to be challenging, but not a chaotic mess.
š„¬ Input Preparation:
- What it is: Choosing images and preparing candidates.
- How it works: (1) Sample real images from COCO and OpenImages with panoptic instance annotations. (2) Keep images with at least three same-category candidates to ensure real distractor pressure. (3) Overlay unobtrusive number tags on candidates (similar to Set-of-Marks) for internal reference during curation.
- Why it matters: Without enough same-category objects, the puzzle is too easy: the model could guess from category alone. š Anchor: Pick a street photo with 5 similar cars instead of one lonely car.
š Hook: When two people look almost the same, you first say how they differ from the crowd, then how they differ from each other.
š„¬ Similarity Judgment (Group A vs Group B):
- What it is: A step where an LLM (e.g., GPT-4o) identifies the most similar pair and gathers distinguishing attributes.
- How it works: (1) Partition candidates into group A (the most similar twoātarget + hard distractor) and group B (the rest). (2) Generate exactly two group-level discriminators (A vs B). (3) Generate four in-pair discriminators (two noticeable, two unnoticeable) that separate the two in A.
- Why it matters: Without this structure, expressions become overspecified or ambiguous, enabling shortcuts. š Anchor: For three people, you might note A vs B: āshort hairā vs ālong hair,ā then in-pair: āsunglassesā vs āno sunglasses.ā
š Hook: Clear, short directions beat long ramblesāif they still uniquely point to one answer.
š„¬ Referring Expression Generation (Minimal but Sufficient):
- What it is: Composing natural sentences that use one group-level and one in-pair discriminator.
- How it works: (1) Two strategies: use targetās positive attributes or use negation of the distractorās attributes. (2) Encourage variety in phrasing while banning number-tag mentions. (3) Provide the image again at this stage to improve accuracy and naturalness.
- Why it matters: Without minimal sufficiency, expressions become too long and let models ignore many clues; with it, every clue counts. š Anchor: āThe cup that is empty (in-pair) and positioned at the center of the pizzas (group).ā
š Hook: Even great assistants can make mistakes; thatās why editors exist.
š„¬ Human Verification (3-annotator agreement):
- What it is: A final correctness and clarity check.
- How it works: (1) Annotators try grounding on the original image (no number tags). (2) They see ground-truth afterward to reflect and finalize. (3) Keep a pair only if all three agree itās correct, unambiguous, and includes hard distractors. (4) LLM-authored keep rate ā 18.7% (highly selective).
- Why it matters: Without this step, subtle errors or ambiguities could slip in and weaken the benchmark. š Anchor: If two annotators find the wrong chair, the example is rejected.
š Hook: āNot the one withā¦ā can be a lifesaver when two items look the same.
š„¬ Negation and Diversity:
- What it is: Purposeful use of ānotā to exclude a near-twin, plus multiple paraphrases.
- How it works: (1) Identify a hard-distractor-only feature (e.g., metal frame). (2) Say ānot the one with a metal frame.ā (3) Mix with a positive clue to keep it short and decisive.
- Why it matters: Without negation, expressions might bloat or fail to cleanly separate lookalikes. š Anchor: āThe modern sofa, not the one with a metal frame.ā
š Hook: Good tests check that guessing tricks donāt work.
š„¬ Anti-Shortcut Quality Checks:
- Bag-of-Words Test:
- What it is: Scramble word order and see if accuracy drops more on Ref-Adv.
- How it works: Replace āa red ball with yellow stripesā by āwith yellow red ball stripes aā and evaluate models.
- Why it matters: If order doesnāt matter, the model isnāt really reading. š Anchor: On Ref-Adv, models drop more than on old datasetsāshowing order matters here.
- Descriptor-Deletion Test:
- What it is: Remove one descriptor and re-evaluate.
- How it works: Extract descriptors with a strong model, delete one, rewrite the sentence, and see if accuracy falls.
- Why it matters: If accuracy doesnāt drop, the clue was unnecessaryāa shortcut warning. š Anchor: On Ref-Adv, accuracy falls more when you remove a single clueāso each clue counts.
- Fixed-Prompt Bias Test:
- What it is: Replace expressions with a generic āthe oneā to detect training-set bias.
- How it works: Keep the image, ask for a box with almost no text guidance, and measure accuracy.
- Why it matters: Lower scores on Ref-Adv suggest it resists dataset-specific biases. š Anchor: Models do better with āthe oneā on old datasets than on Ref-AdvāRef-Adv is harder to game.
š Hook: A measuring stick should scale from easy to hard without breaking.
š„¬ Evaluation Setup and Metrics:
- What it is: Testing many modern MLLMs with fair prompts and clear metrics.
- How it works: (1) Evaluate 13+ MLLMs (Qwen, InternVL, Gemini, GPT-4o, Claude, GLM, CogVLM). (2) Use Acc@IoU 0.5/0.75/0.9 and mean accuracy. (3) Include Chain-of-Thought (CoT) where supported; use Set-of-Marks (SoM) for models with limited grounding.
- Why it matters: Without common metrics and careful prompting, comparisons arenāt meaningful. š Anchor: A model that gets 54% Acc@0.5 on Ref-Adv may have >90% on RefCOCOābut the new test uncovers real reasoning gaps.
Secret sauce (whatās clever):
- Intentionally pairing just-enough language with hard distractors forces models to check every word. The two-stage LLM pipeline prevents overspecification, and the triple human check keeps only cases that truly demand both textual and visual reasoning.
04Experiments & Results
š Hook: Itās like graduating from a spelling bee (easy words) to a debate (hard thinking). Scores naturally drop, but now they mean more.
š„¬ The Test (What and why?):
- What it is: Measure how well models ground concise, demanding expressions in real images with many hard distractors.
- How it works: Evaluate multiple MLLMs on Ref-Adv using accuracy at IoU thresholds (0.5/0.75/0.9) and mean accuracy. Also run ablations: bag-of-words (scramble order), one-descriptor deletion, and fixed āthe oneā prompt.
- Why it matters: If models only succeed when text is simple and images are uncluttered, we havenāt measured true reasoning. š Anchor: A model that struggles when words are scrambled or a single clue is removed is showing it needed those detailsāgood!
š„¬ The Competition (Compared to what?):
- What it is: Classic benchmarks RefCOCO, RefCOCO+, and RefCOCOg.
- How it works: Compare model accuracy on those vs. Ref-Adv under the same IoU rules. Analyze by distractor count and with/without Chain-of-Thought.
- Why it matters: A big drop on Ref-Adv shows old tests were too easy or shortcut-prone. š Anchor: Qwen2.5-VL-72B: >90% on classic sets but ~54% Acc@0.5 on Ref-Adv.
š„¬ The Scoreboard (with context):
- Headline: Despite strong classic scores, state-of-the-art models fall to roughly 50ā60% Acc@0.5 on Ref-Adv. For instance, Qwen2.5-VL-72B reaches about 54.1% at IoU 0.5.
- CoT helps more here: Chain-of-Thought generally improves performance on Ref-Adv (where reasoning is heavier) but is less helpfulāor even harmfulāon older datasets with short expressions and few distractors.
- Distractors bite: Accuracy drops as similar-object count rises, especially 7+ distractors, confirming the benchmarkās pressure.
- Bias resistance: With the fixed prompt āthe one,ā models score higher on old sets than on Ref-Adv, suggesting Ref-Adv is less exploitable by dataset biases.
- Order and sufficiency matter: Bag-of-words (orderless text) reduces accuracy more on Ref-Adv than classics; removing a single descriptor hurts more tooāevidence that models must read carefully and use every clue. š Anchor: On Ref-Adv-s (1,142 public cases), smaller models lag, and āthinkingā variants beat āinstructā variants at the same sizesāconsistent with the benchmarkās reasoning demand.
š„¬ Surprising Findings:
- Even very large MLLMs often pick the hard distractor instead of the true targetāshowing perception or language-integration slips mid-reasoning.
- Some models gain from CoT on Ref-Adv but not on older datasets, aligning with the idea that thinking steps help only when the task truly needs them.
- Normalized vs. absolute coordinate outputs and specialized prompts matter less than the core difficulty injected by hard distractors and minimal expressionsāeveryone is challenged. š Anchor: In qualitative examples, models lay out a sensible plan (āfind all glasses,ā ācompare fullness,ā ācheck corner proximityā) yet still box the wrong glassātiny visual differences now matter a lot.
05Discussion & Limitations
š Hook: A good obstacle course reveals strengths and weaknessesābut itās not the whole world.
š„¬ Limitations:
- Data source scope: Ref-Adv uses COCO and OpenImages photos; diversity is strong but still bounded by those datasets.
- LLM-authored expressions: Despite a strict two-stage pipeline and triple human checks, a small fraction could still contain subtle redundancy or bias.
- Bounding box focus: The benchmark evaluates box grounding, not full segmentation or 3D understanding.
- Naturalness vs. minimality: Short, sufficient expressions are natural, but some users may prefer even richer narratives; balance is a design choice.
š„¬ Required Resources:
- For evaluation: Access to MLLMs (open or proprietary), evaluation scripts, and Ref-Adv/Ref-Adv-s data. GPU use is helpful but not mandatory for API-based models.
- For curation replication: Panoptic annotations, LLM APIs, and human annotators for verification.
š„¬ When NOT to Use:
- If you need long storytelling or dialogue-heavy, multi-turn grounding, Ref-Advās short, minimal expressions may not match your use case.
- If you need pixel-exact segmentation or 3D pose reasoning, a segmentation or 3D benchmark may be better.
- If you want category detection without instance-level disambiguation, object detection benchmarks suffice.
š„¬ Open Questions:
- How to train models that remain robust as distractors scale from 3 to 10+ while staying efficient?
- What architectures best fuse negation and relational cues with fine-grained visual details?
- Can we design training that directly penalizes shortcut behavior and rewards all-clue reasoning?
- How can we extend Ref-Adv to video, 3D scenes, or interactive robotics without re-introducing shortcuts? š Anchor: Think of the next version as adding moving scenes, hands-on interaction, and even more lookalikesāwithout losing the ājust-enough cluesā spirit.
06Conclusion & Future Work
Three-sentence summary: Ref-Adv is a new benchmark that pairs real photos with short but sufficient sentences and many hard distractors so AI must truly use every clue. Compared to classic datasets where models score above 90%, the same models drop to about 50ā60% on Ref-Adv, proving that older tests allowed shortcuts. Careful ablations (word-order scrambling, descriptor deletion, and bias checks) confirm that Ref-Adv genuinely demands both textual and visual reasoning.
Main achievement: The paper delivers a rigorously curated, bias-resistant REC benchmarkāwith public Ref-Adv-s and unified evaluation codeāthat reliably exposes reasoning gaps in modern MLLMs.
Future directions: Train models with anti-shortcut objectives, expand to video/3D/robotics settings, and explore richer negation and relational language. Investigate architectures and training recipes (e.g., thinking modes, grounding-specific rewards) that improve fine-grained perception plus language logic.
Why remember this: Ref-Adv raises the bar from ācan you guess?ā to ādid you truly understand?ā, giving the community a clearer compass for building trustworthy multimodal AI that follows instructions precisely in the messy, lookalike-heavy real world.
Practical Applications
- ā¢Evaluate new multimodal models on Ref-Adv-s to detect shortcut reliance before deployment.
- ā¢Train with negation-rich, hard-distractor data to improve robustness against lookalikes.
- ā¢Use descriptor-completeness loss (penalize missing or ignored clues) to reduce shortcut behavior.
- ā¢Add Chain-of-Thought prompts during inference on hard grounding tasks to boost accuracy.
- ā¢In robotics, test grasping policies on Ref-Adv-like scenes to ensure precise object selection.
- ā¢For accessibility apps, validate that phrases with negation (ānot the oneā¦ā) are grounded correctly.
- ā¢In retail image search, require models to disambiguate among many similar products using minimal cues.
- ā¢Adopt the two-stage discriminator pipeline for creating internal QA datasets with minimal but sufficient expressions.
- ā¢Integrate descriptor-deletion checks into evaluation dashboards to monitor reliance on every clue.
- ā¢Use fixed-prompt bias tests (āthe oneā) to detect dataset bias before claiming state-of-the-art results.