DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories

Chenlong Deng; Mengjie Deng; Junjie Wu; Dun Zeng; Teng Wang; Qingsong Xie; Jiadeng Huang; Shengjie Ma; Changwang Zhang; Zhaoxiang Wang; Jun Wang; Yutao Zhu; Zhicheng Dou

DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories

Intermediate

Chenlong Deng, Mengjie Deng, Junjie Wu et al.2/11/2026

arXiv

Key Summary

•Most image search systems judge each photo by itself, which fails when clues are split across many photos taken over time.
•This paper turns image search into a treasure hunt: an AI agent explores your whole photo history, gathers clues, and reasons in steps to find the right set of photos.
•They build DISBench, a new test made from real visual histories (109,467 photos, 57 users) that forces context-aware reasoning instead of single-shot matching.
•A human–model pipeline mines hidden links (like recurring logos, people, or places) and forms a memory graph, then humans verify hard, context-heavy queries.
•Their baseline agent (ImageSeeker) uses tools (search, metadata filters, photo viewing, web search) plus dual memory to plan long searches.
•Even top models struggle: best F1 is 55.0 and exact set match EM is only 28.7, far below normal retrieval benchmarks.
•Inter-event questions (linking multiple trips/events) are much harder than intra-event ones (find inside one event).
•Direct embedding retrieval performs poorly (Recall@3 ≈ 10–14%), showing that independent matching hits a hard ceiling here.
•Ablations show metadata tools and explicit memory matter most for success; errors mostly come from reasoning breakdowns and fine-grained visual confusion.
•Running several agent attempts in parallel helps a lot (Best@k jumps from 35.4 to 60.8 F1), revealing untapped potential if we can pick better reasoning paths.

Why This Research Matters

Most of us have huge photo libraries where a single picture rarely tells the full story. This work helps AI search like we remember—by connecting who, where, and when across many images. That unlocks better personal assistants for finding family moments, trip highlights, or school events that require context to identify. It also supports journalists, historians, and educators who need to trace events over time with careful verification. By exposing real weaknesses in current systems, the benchmark guides research toward agents that plan, track, and verify. With privacy protections, such context-aware retrieval could make our digital memories far more accessible and meaningful.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re trying to find “the photo where only the lead singer is on stage at the concert with the blue-and-white logo.” You remember the logo from one picture and the singer from another. No single photo tells the whole story—you need to connect clues across many pictures.

🥬 The Concept (Image Retrieval Techniques): It’s about how computers find the right images from a big collection based on a query. Traditionally, they compare your words to each image separately and pick the closest matches.

How it works:
1. Turn the query into a vector (a numeric summary of meaning)
2. Turn every image (and its caption or features) into vectors
3. Rank images by similarity to the query vector
4. Return the top results
Why it matters: Without it, searching photos would be like rifling through boxes in a dark attic. But if it treats each photo alone, it can’t solve puzzles that need context from multiple photos.

🍞 Anchor: When you ask “yellow cat,” old systems show yellow cats—easy. But for “the concert with that blue-and-white logo where only the lead singer is on stage,” independent matching often fails.

🍞 Hook: You know how a memory sometimes lives in pieces—one friend remembers the place, another recalls the date, and a third remembers who was there? You build the full story by combining everyone’s clues.

🥬 The Concept (Vision-Language Models): These are models that understand pictures and words together.

How it works:
1. Learn to pair images and captions
2. Build joint representations so words and images live in the same “space”
3. Use this to match text queries to images
Why it matters: They’re great at “what’s in the image,” but if clues are split across different photos, they still struggle.

🍞 Anchor: A VLM can find “a guitar on stage,” but it won’t know which concert unless you add and use context from other images.

🍞 Hook: Solving a mystery requires thinking about how clues connect over time—what happened first, where it happened, and who was involved.

🥬 The Concept (Contextual Reasoning): It means using background, timing, and relationships between images to answer a question.

How it works:
1. Find anchor clues (like a logo or landmark)
2. Link them to events (like concerts or trips)
3. Apply constraints (only the singer, or “within six months”)
4. Verify candidates by cross-checking multiple images
Why it matters: Without contextual reasoning, the system can’t tell near-duplicates and distractors apart when looks alone aren’t enough.

🍞 Anchor: To find “the statue photographed in two different trips within half a year,” you must compare across events and check dates—appearance alone won’t cut it.

The world before this paper: Most retrieval systems assumed each image could be judged on its own. Even newer “reasoning” methods often used external facts (like a jersey number) but still picked images independently once they knew the target concept. This breaks down when the query depends on evidence split across the visual history.

The problem: Real photo collections have temporal sequences and repeated themes—similar concerts, multiple beach trips, recurring friends. The target might be visually ambiguous (many nearly identical photos), and deciding which ones count requires context (event, time, place, identity, or constraints like “only the lead singer”).

Failed attempts: Researchers tried stronger embeddings, LLM-assisted query decomposition, and external knowledge lookups. These helped with semantics (“what is O’Neal’s jersey number?”) but not with chaining clues scattered across the user’s own photos. Independent matching stayed a hard ceiling: it can’t use cross-image evidence.

The gap: We need systems that plan, explore, and connect clues across the corpus—treating photo search like a multi-step investigation, not a one-shot guess.

Real stakes: Everyday people take thousands of photos. Finding “that one moment” often needs context (“two days after the fireworks,” “the café we went to before the museum”). In families, education, travel, journalism, or safety, context-aware search can surface exactly the right set of photos—saving time and making digital memories truly useful.

🍞 Anchor: Think of your photo library like a comic book series: a single panel rarely explains the plot—you understand by linking panels across pages. That’s the kind of reasoning this paper tackles.

02Core Idea

🍞 Hook: Think of a treasure hunt where the map is not on one page—you must flip through several pages, collect fragments, and assemble them to find the final spot.

🥬 The Concept (DeepImageSearch Paradigm): It’s a new way to do image search where the AI acts like an agent, exploring your whole visual history, planning steps, and chaining clues to find the correct set of images.

How it works:
1. Plan: Break the query into anchors and constraints
2. Explore: Use tools to search, filter by time/place, and view candidates
3. Connect: Build evidence chains across events (logos → event → people → targets)
4. Verify: Re-check candidates and constraints before answering
Why it matters: Without treating retrieval as an exploration, the system misses context-dependent targets and confuses look-alike distractors.

🍞 Anchor: To answer “photos from the musical performance with the blue-and-white logo where only the lead singer is on stage,” the agent first finds the logo (anchor), locates the event, then filters to photos with only the singer.

Three analogies for the same idea:

Detective story: Identify a key clue (logo), find the right crime scene (event), then confirm the suspect (only the singer) with alibis (time/place metadata).
Library hunt: Use the catalog to find the right series (event), then read only the chapters that match the constraints (targets).
Scavenger game: Start at a sign (anchor), follow arrows (associations) to reach the treasure (final photo set).

Before vs. After:

Before: Systems matched the query to each photo separately. Good for “find a yellow cat.” Weak for “find this statue across two trips within six months.”
After: The agent explores, keeps notes, and reasons over the whole collection, enabling context-heavy tasks once thought out of reach.

Why it works (intuition):

Anchors shrink the haystack by localizing likely events.
Metadata (time, GPS) acts like guardrails that filter out look-alikes.
Cross-image associations (same logo, same person) create bridges between distant events.
A dual-memory setup helps the agent stay on track across many steps.

Building blocks (each with a mini sandwich):

🍞 Hook: You know how organizing a trip album helps you remember what happened when? 🥬 The Concept (Visual Histories/Events): A user’s photos over time naturally form events (concerts, trips).
- How it works: Group by photosets; preserve timestamps/locations.
- Why it matters: Events provide the stage where anchors and targets meet. 🍞 Anchor: “All the parade photos from July 4th” form an event.
🍞 Hook: Imagine a corkboard with strings connecting photos that share a logo or a person. 🥬 The Concept (Spatiotemporal Memory Graph): A graph that links photos, events, visual clues, and people across time and space.
- How it works: Nodes for photos, photosets, visual clues, persons; edges for containment and verified reappearances; edges store natural-language rationales.
- Why it matters: Without a map of connections, the agent can’t hop from clue to clue. 🍞 Anchor: A “blue globe logo” node connects multiple photosets from the same festival.
🍞 Hook: Like having a toolbox for a school project: scissors, glue, ruler, each with a job. 🥬 The Concept (Multimodal Agent Tools): Search, get metadata, filter by time/place, view photos, web search.
- How it works: Combine tools step by step; save intermediate subsets; search within subsets.
- Why it matters: No single tool solves the task—coordination is key. 🍞 Anchor: Filter August photos → save as subset → search “beach” within it → inspect candidates.
🍞 Hook: Long stories need notes, or you forget what you figured out. 🥬 The Concept (Dual-Memory System): Keeps explicit photo subsets and compressed summaries of goals and findings.
- How it works: Named subsets persist across steps; session + working memory compress long chats.
- Why it matters: Without memory, the agent loses track in long searches. 🍞 Anchor: “aug5_beach_candidates” stays available while you verify the final three photos.
🍞 Hook: Practice tests help you learn better than random drills. 🥬 The Concept (DISBench): A benchmark that forces agents to use context by posing hard, realistic queries over real visual histories.
- How it works: Human–model pipeline mines associations, builds a memory graph, samples subgraphs, then humans verify and refine queries.
- Why it matters: Without the right test, we can’t see real weaknesses or measure progress. 🍞 Anchor: “Find all photos of the same statue appearing in different trips within half a year.”

Together, these pieces turn retrieval into guided exploration that mirrors how people actually remember and search.

03Methodology

At a high level: Input (visual history + text query) → Plan (anchors + constraints) → Explore with tools (search, filter, inspect) + Track state (memory) → Connect clues via the memory graph → Verify and output the target photo set.

Key steps, with mini sandwiches for new concepts:

Understand the query and find anchors

🍞 Hook: You know how you first look for a landmark when finding a meeting spot?
🥬 The Concept (Anchors): Distinctive clues (logo, landmark, person) that help localize the right event.
- How it works: Parse the query to separate anchors (logo) from target constraints (only the lead singer).
- Why it matters: Without anchors, the search space is too big and noisy.
🍞 Anchor: “Blue-and-white event logo” is used to find the correct concert before filtering to “only the lead singer.”

Explore with tools

🍞 Hook: Like using a map app: search places, filter by date, and zoom in to check.
🥬 The Concept (Toolset Coordination): Combining ImageSearch, GetMetadata, FilterMetadata, ViewPhotos, WebSearch.
- How it works step-by-step:
  1. ImageSearch: retrieve similar photos using text and/or example photos; optionally within a saved subset
  2. GetMetadata: read times and places of candidate photos
  3. FilterMetadata: narrow candidates with rules (e.g., same day, within 10 km, within 6 months)
  4. ViewPhotos: visually verify fine details (is it the same singer/statue?)
  5. WebSearch: resolve external facts if needed (e.g., “What number did O’Neal wear?”)
- Why it matters: Missing any tool breaks the chain—no time filters means distractors flood the list; no viewing means subtle errors slip in.
🍞 Anchor: First filter to “2014-07-31,” then search “sea beach” within that subset, then view the top 5 to confirm.

Keep track of progress with memory

🍞 Hook: When you do a long scavenger hunt, you keep a list of found items and a plan for the next steps.
🥬 The Concept (Dual-Memory System): Explicit subsets + compressed context memory.
- How it works: • Explicit memory: save subsets like “aug5_candidates”; intersect or search within them later • Compressed memory: when the conversation gets long, keep a short summary of goals and key findings
- Why it matters: Without it, multi-step searches unravel—you repeat steps or forget constraints.
🍞 Anchor: After multiple steps, the agent still knows it must find “beach photos two days after fireworks.”

Connect distant clues with a graph

🍞 Hook: Think of threading beads: each bead is a clue; the string is the relationship.
🥬 The Concept (Spatiotemporal Memory Graph): Nodes for photos, events (photosets), visual clues, and people; edges for containment and verified reappearances.
- How it works:
  1. Visual Semantic Parsing: extract summaries, clues (logos, text), and person states via VLM + face clustering
  2. Latent Association Mining: retrieve candidate reappearances within/outside events; verify with a VLM
  3. Graph Construction: add nodes/edges; store edge rationales in natural language
- Why it matters: It exposes paths like logo → event → person → target, which embeddings alone can’t see.
🍞 Anchor: The “lokerse feesten” logo node links two photosets; following that edge localizes the correct concert.

Build a robust benchmark (DISBench) via human–model collaboration

🍞 Hook: A good exam makes you show your work, not just guess.
🥬 The Concept (Query Synthesis Pipeline): A semi-automated pipeline to create hard, context-heavy queries at scale.
- How it works: • Parse images and metadata; list visual clues and person tracks • Mine associations (same clue across photosets) via retrieve→verify • Sample a subgraph with balanced edge types (avoid only within-event edges) • Ask a model to propose multi-step queries requiring anchors and context • Humans filter, annotate all targets exhaustively, refine wording, and cross-validate
- Why it matters: Ensures queries need context, aren’t solvable by looks alone, and have complete, reliable answers.
🍞 Anchor: Retained 122 tough queries out of ~2,000 candidates (6.1%), with annotator IoU ≈ 0.91.

Two query types the agent must handle

🍞 Hook: Sometimes you zoom into one party; sometimes you compare across different parties.
🥬 The Concept (Intra-Event vs. Inter-Event Queries):
- How it works: • Intra-Event: localize the event with an anchor; filter within-event to find targets • Inter-Event: scan multiple events to verify recurrence plus constraints (time/place windows)
- Why it matters: Inter-event is harder and needs stronger memory and verification.
🍞 Anchor: “Only the lead singer on stage” (intra-event) vs. “same statue in two trips within 6 months” (inter-event).

The secret sauce

Balanced subgraph sampling prevents only-easy, within-event questions
Strict query constraints force visual ambiguity + contextual identifiability
Fine-grained tools + dual memory keep long-horizon reasoning stable

Concrete example (full mini-run):

Query: “Find sea photos taken at the beach two days after the fireworks show.”
1. Anchor fireworks via ImageSearch; disambiguate multiple events with GetMetadata
2. Compute dates +2; FilterMetadata for each date
3. Search “sea beach” within filtered subsets
4. ViewPhotos to confirm final candidates
5. Return the set of IDs

04Experiments & Results

🍞 Hook: Grading a tough math test tells you where students really need help. This benchmark is that tough test for image search agents.

🥬 The Concept (Evaluation Setup): Test many multimodal agents with the same tools and memory to compare pure reasoning skill.

How it works:
1. Agentic evaluation: models plan and call tools to return a set of photo IDs
2. Metrics: Exact Match (EM) for perfect set; F1 for overlap quality
3. Retrieval baselines: pure embedding search with MAP/Recall/NDCG@k
Why it matters: Without standard tools and metrics, we can’t fairly measure progress.

🍞 Anchor: Every model gets the same toolbox; only their thinking differs.

Main scoreboard (context):

Best model: EM 28.7, F1 55.0. That’s like getting a solid B for partial credit but rarely scoring 100% on the exact set.
Big gap from traditional retrieval tasks where top models often near the ceiling.
Inter-Event is much harder than Intra-Event for strong models—cross-event linking is the main headache.

Direct retrieval baseline (why embeddings hit a ceiling):

Qwen3-VL-Embedding (2B/8B) and Seed-1.6-Embedding show Recall@3 ≈ 10–14%, NDCG@5 ≈ 13–17%.
Interpretation: Imagine a classroom where students pick from look-alike answers. Even strong guessers often grab distractors because they can’t use context rules (time/place/person) to filter.

Ablations (what matters most):

Removing GetMetadata drops F1 by ~5.7 points—the time/place guardrails are crucial to separate near-duplicates.
Removing explicit memory hurts more on Inter-Event tasks—long hunts need saved subsets to keep track of partial finds.
ViewPhotos and WebSearch add complementary gains (fine-grained checks and external facts).
Memory compression has the smallest effect, meaning summaries still preserve enough state for most tasks.

Test-time scaling (hidden potential):

Run N parallel attempts and pick the best (Best@k) or majority vote.
Best@k jumps from 35.4 → 60.8 F1 as N increases—some runs stumble, but others find great reasoning paths.
Majority voting lags, showing the hard part is choosing the right path, not just averaging.

Error analysis (where agents stumble):

Reasoning breakdown (36–50%): agents reach the right area but lose the thread—stop early, forget constraints, or skip a needed check.
Visual discrimination: mix up similar statues/churches/singers; identity and attribute checks remain tough.
Episode misgrounding + clue mislocalization: fail to anchor to the correct event or place, echoing the Inter-Event difficulty.

🍞 Anchor: In a case study, the agent found one correct close-up of a church spire but missed another from a different angle—fine-grained visual judgement and systematic re-checking were the missing steps.

05Discussion & Limitations

Limitations:

Scale: DISBench has 122 high-quality, hard queries across 57 users—smaller than huge retrieval sets, but purposefully strict to ensure reliability.
Data source: Built from YFCC100M (Flickr). That consistency helps reproducibility but may carry demographic skew relative to smartphone-era libraries.
Metadata assumption: Tasks expect timestamps and GPS; many consumer photos have them, but not all. Systems need fallbacks when metadata is missing.
Current agent baseline: Modular and simple by design; not a final solution. There’s room for reflection, backtracking, or learned planning.
Visual fine-grain: Identity and attribute-level decisions still cause many errors.

Required resources:

A capable multimodal model with tool-use ability
Embedding index per user for ImageSearch
Metadata readers and filters, plus a lightweight web search API
Memory management to handle long reasoning traces (≈128K context helpful)

When NOT to use:

Purely visual, one-shot queries (“a red apple on a table”) where embeddings already excel
Collections with almost no metadata and minimal cross-image recurrence
Hard privacy settings where cross-photo linking isn’t permitted

Open questions:

How to pick or fuse the best reasoning path among many (beyond Best@k)?
Can agents learn policies to plan, backtrack, and verify more reliably?
How to robustly resolve identities and fine-grained attributes across angles/lighting?
Can we scale query synthesis while keeping the same level of difficulty and annotation trustworthiness?
How to handle missing or noisy metadata via imputation and uncertainty-aware reasoning?

06Conclusion & Future Work

Three-sentence summary:

This paper reframes image retrieval as agentic exploration over visual histories, where answers require chaining clues across multiple photos.
It introduces DISBench, a carefully built benchmark with context-heavy queries, and a baseline agent with coordinated tools and dual memory.
Experiments show state-of-the-art models still struggle (EM 28.7, F1 55.0), proving that corpus-level reasoning is the next frontier for retrieval.

Main achievement:

Defining and operationalizing context-aware, multi-step image retrieval—complete with a scalable data construction pipeline and a rigorous benchmark that exposes real capability gaps.

Future directions:

Train agents to plan, reflect, and backtrack; learn to select promising reasoning paths; integrate stronger identity and attribute verification; and handle missing metadata.
Expand the benchmark with diverse sources and more queries while preserving annotation quality.

Why remember this:

It marks the shift from “find a look” to “follow the story.” By making retrieval context-aware and agentic, we move closer to assistants that can truly navigate our visual memories the way we do—by connecting who, where, and when into one coherent answer.

Practical Applications

•Personal photo assistants that can answer context-heavy requests like “the picnic photos the day after graduation.”
•Family archive builders that stitch together consistent stories across years (same people, places, and traditions).
•Travel diaries that auto-group revisited landmarks and show how they changed over time.
•Education tools that teach timelines by linking event photos with dates and locations.
•News and OSINT workflows that verify recurring scenes or objects across different days and places.
•Safety and compliance reviews that find all instances of a specific sign, badge, or uniform across events within certain dates.
•Enterprise media libraries that locate campaign assets tied to particular venues or time windows.
•Museum or gallery digitization search that links repeated artworks, labels, and rooms across exhibits.
•Photo-cleanup tools that detect near-duplicates but keep the contextually “correct” shots based on constraints.
•Assistive tech for memory support, helping users recall sequences (“before/after the hospital visit”) through natural language.

Version: 1