Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

Qianben Chen; Tianrui Qin; King Zhu; Qiexiang Wang; Chengjun Yu; Shu Xu; Jiaqi Wu; Jiayu Zhang; Xinpeng Liu; Xin Gui; Jingyi Cao; Piaohong Wang; Dingfeng Shi; He Zhu; Tiannan Wang; Yuqing Wang; Maojia Song; Tianyu Zheng; Ge Zhang; Jian Yang; Jiaheng Liu; Minghao Liu; Yuchen Eleanor Jiang; Wangchunshu Zhou

Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

Intermediate

Qianben Chen, Tianrui Qin, King Zhu et al.2/26/2026

arXiv

Key Summary

•This paper shows that letting an AI search many places at the same time (in parallel) can beat making it think in long, slow chains.
•The new framework, called Search More, Think Less (SMTL), plans the task, runs several subtasks together, and keeps refining the plan as results come in.
•SMTL also includes a data-making pipeline that creates both exact-answer questions and open-ended research tasks, so one agent can handle many types of problems.
•Across tough benchmarks like BrowseComp, GAIA, Xbench, and DeepResearch Bench, SMTL is competitive or state of the art while using far fewer reasoning steps.
•On BrowseComp, SMTL-300 reached 48.6% accuracy and needed much fewer steps than strong baselines, cutting average steps by around 70.7% versus MiroThinker-v1.0.
•SMTL achieves efficiency by doing more tool calls per step (about 3.5) to gather dense evidence instead of stretching long chains of thought.
•A plan-centered context strategy avoids overflowing the model’s memory window by periodically refreshing the plan and dropping old clutter.
•Training mixes supervised fine-tuning and reinforcement learning so the agent learns both stable habits and reward-seeking improvements.
•Wider search breadth (higher top-k) helps more than just taking more steps; packing more candidates per search action boosts success.
•SMTL’s main idea: widen the funnel of evidence first, then think just enough to stitch it together correctly.

Why This Research Matters

SMTL helps AI assistants deliver trustworthy answers faster by widening the search first and thinking just enough to stitch evidence together. This reduces wait times for tasks like homework help, travel planning, shopping comparisons, or news verification. For professionals, it means more reliable briefings and reports that survey multiple sources without bogging down. By training on both exact-answer and open-ended tasks, one agent can switch gracefully between precise facts and nuanced synthesis. The plan-centered memory approach keeps long sessions on track within fixed context budgets. Overall, SMTL sets a practical path to scalable, efficient research agents that are both quick and careful.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you and your friends need to write a big school report. One way is to let one friend read one website at a time, think a lot, then read another, and so on. That’s slow. Another way is to split up and read many websites at once, then meet to share notes. That’s much faster and often better.

🥬 The World Before: AI “research agents” used to get better mostly by thinking longer—adding more steps of reasoning, more tool calls, and more back-and-forth with the web. This did help accuracy on hard tasks, like finding multi-step facts across different websites or writing reports that combine many sources. But there was a catch: it made answers slower and more expensive to produce. Latency went up because the agent walked in a single line—one action at a time—like a single-file parade that never speeds up.

🥬 The Problem: Two headaches showed up. First, efficiency: long, sequential chains meant lots of waiting for each tool call, and the model’s memory (context window) filled with clutter before the task finished. Second, generalization: some benchmarks ask for exact answers (like “What year…?”), while others want open-ended research reports (like “Compare the policies of…”). Agents trained for one type often stumbled on the other because the goals and scoring rules are different.

🥬 Failed Attempts: People tried longer contexts, more careful chains of thought, or stricter step-by-step protocols. These helped a bit but kept the same bottleneck: one-lane traffic. Others made synthetic training data, but mostly for exact Q&A or tightly shaped tasks. That left open-ended research under-served, and it didn’t explicitly teach the agent to gather evidence efficiently (lots of repeated or low-signal steps remained).

🥬 The Gap: What was missing was a way to run subtasks together (not just one after another), then keep the plan tidy as results poured in, and a data pipeline that creates both exact-answer and open-ended tasks so one agent can learn to handle both. In short: a parallel workflow plus plan-centered memory management, trained on unified, diverse tasks.

🥬 Why It Matters: In daily life, speed plus quality matters. If your homework helper, shopping assistant, or travel planner can scan more options at once and quickly confirm details, you get better answers faster. For professionals (journalists, analysts, researchers), gathering trustworthy evidence from many places at once and neatly summarizing it is gold. Without this, agents get bogged down, miss key sources, or run out of memory.

🍞 Anchor: Think of planning a class trip. The old way: one student calls the bus company, waits, then checks museum tickets, waits, then asks for lunch options—one by one. The new way: three students do those calls at the same time and text updates into a shared doc. You finish faster, you don’t forget details, and you choose better options. That’s the spirit of this paper.

02Core Idea

🍞 Hook: You know how bees visit many flowers at once and then bring nectar back to the hive? The hive doesn’t send just one bee in a straight line—it sends a team.

🥬 The “Aha!” Moment (one sentence): Replace deep, linear chains of thought with a wide, parallel hunt for evidence—then stitch it together with a plan that keeps getting refreshed.

Multiple Analogies:

Bees and Flowers: Many bees (subtasks) collect nectar (evidence) in parallel, and the hive (the planner) keeps updating where to go next based on which fields are blooming (useful sources).
Cooking Show: Prep multiple ingredients at the same time (search and crawl several pages), then assemble the dish (final answer) quickly because everything is already chopped and ready.
Library Scavenger Hunt: Teams search different shelves at once; a coordinator updates the map as hints arrive, avoiding crowded aisles and dead-ends.

Before vs After:

Before: Agents pushed longer reasoning chains, often issuing one tool call per step, refocusing repeatedly, and risking context overflow.
After: SMTL creates an initial plan, runs multiple subtasks in parallel, regularly refines the plan using the latest evidence, and manages context by refreshing the plan so the memory stays useful.

Why It Works (intuition, not equations):

Bottleneck Buster: Parallel subtasks lift the one-lane traffic jam; more signal per unit time.
Information Density: Doing several targeted tool calls per step packs richer evidence into each round, reducing guessy reformulations.
Plan-Centered Memory: By periodically refreshing the plan and dropping stale logs, the agent keeps the essentials while staying within context limits.
Unified Training: Creating both exact and open-ended tasks teaches the agent to switch goals gracefully, so it generalizes better.

Building Blocks (explained with Sandwich pattern where new):

🍞 Hook: Imagine a coach making a game plan before kickoff. 🥬 Concept: Initial Plan. What: a structured list of subtasks and how they connect. How: break the big question into smaller, partly independent goals (retrieve fact A, verify relation B). Why: without a plan, the agent zigzags and repeats work. 🍞 Anchor: For “Which scientist wrote X and won prize Y?”, subtasks might be: (1) find candidates who wrote X, (2) check prize Y winners, (3) match names and years.

🍞 Hook: Picture opening several browser tabs to compare hotels at once. 🥬 Concept: Parallel Evidence Acquisition. What: collect information from multiple sources at the same time. How: issue several web searches or page crawls in one step, each with a clear goal. Why: without it, everything queues up; you wait longer and miss opportunities. 🍞 Anchor: While one tab checks museum hours, another checks ticket prices, and another verifies holiday closures—faster trip planning.

🍞 Hook: Think of cleaning your desk by keeping only the current worksheet on top. 🥬 Concept: Plan-Centered Context Management. What: periodically refresh the plan, keep key progress, and drop old clutter. How: after a few steps, summarize what’s done, what’s blocked, and what’s next; then continue from this clean plan. Why: without it, the memory gets jammed, and the agent forgets important bits. 🍞 Anchor: You keep a neat to-do list instead of stacks of scratch paper; you move quicker.

🍞 Hook: Like building a giant practice workbook that includes both math drills (exact answers) and essays (open-ended writing). 🥬 Concept: Unified Data Synthesis Pipeline. What: a system that constructs tasks for both deterministic Q&A and open-ended research. How: build knowledge graphs, extract subgraphs, generate multi-hop questions, and create report-style prompts—then verify quality and variety. Why: without both kinds, the agent gets lopsided training and fails to generalize. 🍞 Anchor: A soccer team practices dribbling (specific skill) and scrimmages (open play); they do better in real matches.

🍞 Hook: Picture learning with a teacher first, then practicing for points. 🥬 Concept: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). What: SFT teaches good habits from examples; RL rewards successful outcomes. How: SFT on curated parallel-search trajectories; then RL with a judge that gives a reward for correct final answers and proper tool use. Why: without SFT, behavior is unstable; without RL, it won’t push for top results. 🍞 Anchor: Music lessons (SFT) plus recitals with scores (RL) make a confident performer.

Bottom line: By searching wider first, then thinking just enough with a tidy and refreshed plan, SMTL is both faster and more reliable across many task types.

03Methodology

High-Level Recipe: Input question → Initial plan (decompose) → Parallel tool calls (search/crawl) → Aggregate observations into a reasoning state → Periodic plan refinement and context refresh → Final answer.

We will follow the Sandwich pattern for each key step.

Initial Plan Construction 🍞 Hook: You know how you outline a school report before writing it? 🥬 What: The agent creates a plan listing subtasks and their relationships. How: It decomposes the big question into smaller goals (retrieve, verify, compare), ordering them so many can run at once. Why: Without a plan, steps collide, evidence repeats, and time is wasted. 🍞 Anchor: For “Find the artist who painted a mural in 1992 in City Z and later won Award Q,” subtasks might be: (1) find murals in City Z in 1992, (2) list artists, (3) check Award Q winners, (4) intersect names.
Parallel Execution and State Aggregation 🍞 Hook: Imagine three classmates each checking a different reliable site at the same time. 🥬 What: The agent executes several subtasks in parallel using tools (we $b_s$ earch, craw $l_p$ age) and merges what they find. How: It picks ready subtasks, launches multiple tool calls per step, then combines observations into a shared “reasoning state.” Why: Without parallelism, the agent queues tasks and loses time reformulating queries. 🍞 Anchor: While one call searches “1992 mural City Z,” another crawls the city’s arts council page, and a third checks a museum archive.

Mathematical snapshot of state updates:

Formula: $s_{t+1} = F\big(s_t, \{a^{(k)}_t\}_{k=1}^m, \{o^{(k)}_t\}_{k=1}^m\big)$ . Example: suppose at step $t=3$ , $m=2$ actions are issued: $a^{(1)}_3$ is a we $b_s$ earch for “1992 City Z mural list,” $a^{(2)}_3$ is a craw $l_p$ age for a city archive URL; observations $o^{(1)}_3$ return 5 candidate links, and $o^{(2)}_3$ returns 2 artist names. If $s_3$ had 10 facts recorded, $F$ merges these to produce $s_4$ with 10+5+2=17 tracked items.

Periodic Plan Refinement 🍞 Hook: Picture a coach calling a quick timeout to update the game plan. 🥬 What: Every few steps, the agent updates the plan (remove completed subtasks, add new ones, re-order blocked ones). How: It checks progress, dependencies, and which evidence paths look promising. Why: Without refinement, the agent keeps hammering on blocked paths or misses newly revealed shortcuts. 🍞 Anchor: After finding the likely artist shortlist, the plan drops “find mural list” (done) and adds “verify award year” (new).

Mathematical snapshot of refinement:

Formula: $G_{t+\Delta}^{\text{plan}} = R\big(G_t^{\text{plan}}, C_t, P_t, s_t\big)$ . Example: at $t=10$ , the plan has 4 subtasks; completed set $C_t$ has size 2, pending $P_t$ has size 2. The refinement $R$ splits one pending subtask into 2 clearer checks (award year vs award category), yielding a new plan at $t+\Delta$ with 3 active subtasks and 2 completed (total 5 tracked units).

Plan-Centered Context Management 🍞 Hook: Think of rewriting your to-do list neatly after it gets messy, keeping only what matters. 🥬 What: The agent periodically “refreshes” the conversation by summarizing the latest plan and dropping pre-plan clutter when nearing the context limit. How: On overflow, it creates a concise, up-to-date plan snapshot and restarts from that snapshot so the context stays within the 128K window. Why: Without this, long runs spill beyond the model’s memory, and critical info gets diluted. 🍞 Anchor: When the notes get too long, you keep the clean outline (what’s done, what’s next) and store away old scratch work.
Tooling 🍞 Hook: Like having just two versatile kitchen tools you use skillfully. 🥬 What: Two core tools—we $b_s$ earch (breadth) and craw $l_p$ age (depth with a goal). How: First search to find candidate URLs; then crawl specific pages with a goal prompt so the summary targets exactly what the subtask needs. Why: Without precise goals, crawls drift and summaries become generic. 🍞 Anchor: Search for “City Z 1992 mural festival,” then crawl the festival’s page with the goal “extract the official artist list and dates.”
Unified Data Synthesis Pipeline 🍞 Hook: Building a rich practice workbook with both quizzes and essays. 🥬 What: A pipeline that constructs both Deep Search (exact) and Deep Research (open-ended) tasks from graph-structured corpora. How: Build knowledge graphs from curated web content, extract subgraphs (control hops and branching), generate multi-hop questions, obfuscate leaks, and verify; for open-ended tasks, craft report prompts and filter trajectories by rules and a judge. Why: Without both types, the agent won’t learn the full range of skills. 🍞 Anchor: From a subgraph around a scientist, generate: (a) “Which country hosted the award they won in year X?” and (b) “Compare their early influences with peers and discuss impact.”
Training: SFT then RL 🍞 Hook: First you learn from a teacher, then you practice for points. 🥬 What: Supervised Fine-Tuning (SFT) seeds stable, efficient habits; Reinforcement Learning (RL) rewards good outcomes and tool hygiene. How: SFT on curated, short, successful parallel trajectories; RL with an LLM-as-judge that gives a reward of 1 for correct final answers (and zero for format/tool errors). Why: Without SFT, behavior is shaky; without RL, it won’t optimize toward wins. 🍞 Anchor: Piano lessons (SFT) make your basics solid; recitals with judges (RL) push you to polish.

Implementation Highlights (with concrete mini-examples):

More tool calls per step: averaging about 3–4 calls to pack evidence early. Example: In step 5, do 2 searches (artists list; award list) and 1 crawl (official festival page) at once.
Top-k breadth: returning more URLs per search widens candidate coverage. Example: top-k=8 yields 8 URL candidates per query, increasing chances of finding the official page quickly.
Pass@1 and judging: a specialized judge prompt checks if the final answer matches ground truth or if a research report meets breadth, depth, and clarity.

Secret Sauce: Instead of thinking longer, SMTL searches wider, then tidies its mind (the plan) often. That keeps it fast, focused, and adaptable.

04Experiments & Results

🍞 Hook: Imagine a science fair where teams are judged on speed, accuracy, and how well they explain things—not just on building the tallest tower.

🥬 The Test: The team evaluated SMTL on two families of tasks: (1) Deep Search—exact-answer problems like BrowseComp, GAIA, Xbench-DeepSearch, WebWalker-QA; and (2) Deep Research—open-ended report writing on DeepResearch Bench (RACE). Why these? Because they stress both precision and synthesis, and they’re long-horizon: you must find, verify, and combine evidence.

What They Measured and Why:

Accuracy (Deep Search): pass@1 judged by an LLM prompt—did the final answer semantically match the ground truth?
Report Quality (Deep Research): four dimensions—Comprehensiveness, Insight/Depth, Instruction Following, Readability—again via an LLM judge prompt, because there isn’t a single “right” essay.
Efficiency: steps per task and tool calls per step—are we faster and denser? Fewer steps and more info per step signal efficiency.
Latency: how quickly answers arrive—parallel execution should reduce wait time.

The Competition: Strong closed models with tools (Claude-4.5, GPT-5, Gemini-2.5), popular research systems (OpenAI DeepResearch, Gemini DeepResearch, Perplexity DR), and open-source agentic baselines (WebSailor-32B, WebShaper-32B, DeepMiner-32B-RL, AFM-32B-RL, Tongyi-DR-30B, MiroThinker-v1.0-30B).

The Scoreboard, in plain meaning:

BrowseComp: SMTL-300 hits 48.6%. Think of this as getting the highest score in the class while also finishing the test faster. SMTL-100 already does well at 43.6%, and the jump to 48.6% with 300 steps shows extra exploration helps most on deep, tricky cases.
Other Deep Search: $GAIA ≈ 75$ .7%, Xbench- $DS ≈ 82$ .0%, WebWalker- $QA ≈ 76$ .5% at higher budgets—strong across the board.
Deep Research Bench (RACE): $Overall ≈ 45$ .9% with balanced sub-scores: $Comprehensiveness ≈ 42$ .1%, Insight/ $Depth ≈ 45$ .6%, Instruction $Following ≈ 49$ .6%, $Readability ≈ 45$ .5%. This is competitive with or better than many 30B-scale systems, showing real generalization.

Efficiency Findings (why it’s like doing more with fewer moves):

Steps: On BrowseComp, SMTL-100 solves tasks $in ≈ 60$ .4 steps on average, while some baselines take 75–206 steps for similar or worse accuracy. That’s like finishing a maze in one-third the turns.
Tool Calls per Step: ≈ 3.5 vs. ≈ 1 for sequential baselines. More parallel tabs open means richer info per round and fewer reformulations.
Pareto Frontier: Across budgets (50→300 steps), SMTL sits at the best tradeoff curve—higher accuracy for a given number of steps.

Surprising/Useful Observations:

More Steps Help Failures, Not Successes: Successful cases often finish well before the limit; failed cases tend to run to the max budget. So raising the step cap mainly rescues hard cases that needed just a bit more exploration to find the right path.
Breadth Beats Depth for Scaling: Increasing search top-k (e.g., from 4→8→20) consistently improves scores, with diminishing but real returns. Packing more candidate URLs into each search action is more efficient than stretching longer chains of thought.

🍞 Anchor: It’s like a debate team that prepares by skimming many credible sources at once. They don’t talk more—they gather better notes faster, then present a crisp, solid argument. SMTL wins by widening the funnel, not by rambling longer.

05Discussion & Limitations

🍞 Hook: Think of a super-fast team that still needs a good internet connection and a neat playbook to shine.

🥬 Limitations (be specific):

Tool Dependence: SMTL leans on web search and page crawling. If APIs throttle, pages block bots, or content is behind paywalls, performance drops.
Context Budget Friction: Even with plan refreshes, extremely long tasks can still push the 128K window. Rare edge cases may require more aggressive summarization.
Judge Sensitivity: Using an LLM-as-a-judge can introduce bias or variance; misjudged answers affect both training (RL rewards) and evaluation.
Domain Mismatch: The data pipeline is web-centric; tasks requiring private databases, specialized software, or non-text modalities may need new tools and training data.
Latency Variability: Parallel calls speed things up overall, but slow external sites can bottleneck a whole round.

Required Resources:

A capable 30B-scale backbone with long context support, plus stable access to search and crawling services.
Computing for SFT and RL, including on-policy rollouts and judge evaluations.

When NOT to Use:

Fully offline or air-gapped settings with no external information access.
Tasks where a single authoritative database lookup trivially solves the problem—parallel search overhead isn’t worth it.
Highly sensitive domains where LLM judges and open-web sources are not acceptable for compliance or safety.

Open Questions:

Smarter Breadth Control: Can the agent learn to set top-k dynamically per query to balance recall and noise?
Better Plan Compression: Can we compress plans to preserve verification-critical details while shrinking the token footprint further?
Robustness to Noisy Pages: How can the agent auto-detect low-quality or misleading sources and prioritize verification?
Reward Design: Beyond outcome-only rewards, can we add verifiable intermediate rewards (e.g., source credibility, cross-source agreement) without gaming the system?

🍞 Anchor: Like a great basketball team that thrives on fast breaks, SMTL shines with space and passes—but it still needs a reliable court, fair refs, and a smart playbook to win big.

06Conclusion & Future Work

Three-Sentence Summary:

SMTL replaces slow, linear reasoning with a parallel plan that gathers more evidence per step and refreshes itself to stay within memory limits.
A unified data pipeline produces both exact-answer and open-ended tasks, so one agent learns to handle many goals and evaluation rules.
Trained with SFT and RL, SMTL reaches competitive or state-of-the-art results across diverse benchmarks while cutting steps and latency.

Main Achievement: Showing that widening search (parallel evidence acquisition) plus plan-centered context management yields a better efficiency–accuracy tradeoff than simply deepening chains of thought, and that the same approach generalizes from precise Q&A to open-ended research.

Future Directions:

Learn dynamic breadth (adaptive top-k), richer plan compression, and stronger verification signals to tame noisy web content.
Expand tools (e.g., structured databases, code execution, multimodal sources) and broaden the data pipeline beyond web text.
Explore more principled rewards that capture fact-checking and cross-source consistency.

Why Remember This: When problems are long and messy, don’t just think harder in a straight line—search wider in parallel, keep a tidy, refreshed plan, and stitch the best evidence together. That mindset—Search More, Think Less—can guide the next generation of fast, reliable research agents.

Practical Applications

•Rapid fact-checking: Search multiple authoritative sources in parallel and verify claims before presenting a concise answer.
•Market research: Gather product specs, prices, and reviews simultaneously, then synthesize a buyer’s guide.
•Academic assistance: Collect papers, dates, and definitions at once, then create a short literature summary with citations.
•Travel planning: Check flight options, local events, and museum hours in parallel, then propose an optimized itinerary.
•Competitor analysis: Crawl company pages, news, and filings together, then compile a structured comparison.
•Policy brief creation: Aggregate sources on regulations across regions and synthesize a readable, cited brief.
•Incident response research: Parallel scan advisories, repos, and forums to summarize vulnerabilities and mitigations.
•Customer support knowledge building: Collect FAQs from product pages and forums to create a unified help article.
•Hiring research: Summarize candidate public portfolios and publications with side-by-side verification.
•Grant proposal prep: Gather prior work, datasets, and benchmarks to build a well-cited related work section.

Version: 1