DREAM: Deep Research Evaluation with Agentic Metrics

Elad Ben Avraham; Changhao Li; Ron Dorfman; Roy Ganz; Oren Nuriel; Amir Dudai; Aviad Aberdam; Noah Flynn; Elman Mansimov; Adi Kalyanpur; Ron Litman

DREAM: Deep Research Evaluation with Agentic Metrics

Intermediate

Elad Ben Avraham, Changhao Li, Ron Dorfman et al.2/21/2026

arXiv

Key Summary

•Deep research agents write long reports, but old tests often judge only how smooth they sound and whether they add links, not whether the facts are true today or the logic really holds.
•The paper spots the Mirage of Synthesis: reports can look great on the surface while hiding factual mistakes, outdated info, or shaky reasoning.
•It introduces a four-part map of quality: Presentation Quality, Task Compliance, Analytical Depth, and Source Quality.
•The big gap is capability mismatch: static judges can’t browse, check dates, or verify facts like research agents can.
•DREAM fixes this by making the judge an agent too, using tools to search, verify, and think step by step (capability parity).
•DREAM builds a custom checklist for each question (adaptive metrics) and also uses universal checks (static metrics).
•Key-Information Coverage (KIC) and Reasoning Quality (RQ) are agent-built tests that catch missing updates and flawed logic.
•DREAM’s factuality test checks truth against the live web, not just against the report’s citations.
•Experiments show DREAM is much more sensitive to time decay and hidden reasoning or factual errors than popular benchmarks.
•This gives a scalable, reference-free way to judge deep research that better matches real-world needs.

Why This Research Matters

Real decisions—about health, finance, policy, and education—depend on research that is current, correct, and logically sound. If evaluations reward only smooth writing and matching citations, people can be misled by confident but outdated or false reports. DREAM’s agentic judging brings web tools and verification into the evaluation loop, catching time-sensitive changes and subtle reasoning errors. This reduces the risk of acting on misinformation, especially in fast-moving domains like law, markets, and technology. It also gives builders a clearer signal for improving their agents, accelerating practical progress. In short, better judging leads to safer, smarter uses of AI in daily life.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine your class writes science reports. Two students might write very different reports about the same topic, and both could still be great. So, how do you grade fairly when there isn’t just one correct answer?

🥬 The Concept (AI Agents): An AI agent is a computer helper that reads, searches, and writes to complete tasks on its own. How it works:

It receives a goal (like “research TikTok’s legal status in the U.S.”).
It plans steps: search, read, compare, summarize.
It uses tools (like web search) and then produces a report. Why it matters: If we don’t evaluate well, we might trust reports that look good but aren’t actually correct. 🍞 Anchor: Like a student doing a library project: the agent looks up books (web pages), takes notes, and writes a paper.

🍞 Hook: You know how grades need clear rules? In research, we also need clear ways to judge quality.

🥬 The Concept (Research Evaluation): Research evaluation is judging how good a report is across several dimensions, not just if a single answer is right. How it works:

Decide what “good” means (clarity, completeness, logic, truth, sources).
Measure each part with specific checks.
Combine the checks into a final score. Why it matters: Without the right checks, we reward style over substance. 🍞 Anchor: Teachers use rubrics (clarity, evidence, organization) to grade essays; research needs rubrics too.

🍞 Hook: Picture a shiny poster that looks amazing but has the wrong facts. It fools you at first glance.

🥬 The Concept (Mirage of Synthesis): The Mirage of Synthesis is when a report seems excellent—smooth writing and matching citations—but hides outdated info, wrong facts, or shaky logic. How it works:

The report sounds fluent and authoritative.
It includes citations that match the text.
But the facts may be old, misinterpreted, or logically weak. Why it matters: We can be tricked into trusting something that isn’t true or current. 🍞 Anchor: A history project with perfect design but wrong dates still deserves a low score.

🍞 Hook: Think of four pillars holding up a building. If even one cracks, the building is unsafe.

🥬 The Concept (Four-Vertical Taxonomy): A taxonomy is a tidy way to group the pieces of report quality into four verticals. How it works:

Presentation Quality: Is the writing clear and well-organized?
Task Compliance: Did it follow the instructions and cover the needed topics?
Analytical Depth: Is the reasoning strong and insightful?
Source Quality: Are sources used properly, faithfully, and credibly—and are claims actually true? Why it matters: Checking all four catches surface polish and deeper truth. 🍞 Anchor: Like grading an essay for writing, following the prompt, strong arguments, and trustworthy references.

🍞 Hook: Imagine a ref who can’t see in the dark trying to judge a night game.

🥬 The Concept (Capability Mismatch): Capability mismatch is when the evaluator can’t do what the research agent can do—like browse, update facts, and verify claims. How it works:

Static judges only read the report.
They don’t search the web or check dates.
They miss time-sensitive mistakes and ungrounded logic. Why it matters: Reports that “sound right” slip by even when they’re wrong. 🍞 Anchor: A spelling bee judge who can’t hear letters clearly will score unfairly.

🍞 Hook: When both teams use the same rules and gear, the game is fair.

🥬 The Concept (Capability Parity): Capability parity means the evaluator should have similar abilities as the agent it judges, including tool use and reasoning. How it works:

Equip the evaluator with web search and other tools.
Let it check facts and timelines directly.
Have it probe the logic, not just the surface. Why it matters: Fair judging needs equal power to verify. 🍞 Anchor: A math contest grader who can also use a calculator checks answers more accurately.

Before this paper, most evaluation was static: LLM judges or citation alignment workflows. These caught fluency and whether a citation text matched the claim. But they missed two big issues: time decay (old facts) and external truth (a claim can match its citation but still be false today). That creates the Mirage of Synthesis. The missing piece was an evaluator that can actively research like the agent.

This paper fills that gap with DREAM, which makes the evaluation itself agentic. It uses universal checks for writing and source behavior and builds a custom, up-to-date checklist and reasoning probes for each query. The stakes are real: students, journalists, doctors, and business analysts rely on reports that must be not just pretty, but correct now and logically sound. If evaluation can’t tell the difference, people make decisions on shaky ground.

02Core Idea

🍞 Hook: Imagine grading a current events report without the internet—how would you know if yesterday’s facts changed today?

🥬 The Concept (DREAM): DREAM turns the evaluator into an active research agent so it can verify facts, dates, and reasoning—just like the report-writing agent. How it works:

Build a custom evaluation protocol for the question: some checks are universal (static), others are freshly built for this query (adaptive).
Use the right “judge” for each check: a plain LLM when reading is enough, an agent with tools when you must verify, and a workflow for citations and domain credibility.
Score writing, factuality, coverage of key info, reasoning quality, citation integrity, and source authority. Why it matters: Without an agentic evaluator, we keep rewarding well-written but wrong or outdated reports. 🍞 Anchor: Like a science fair judge who brings a thermometer and scale to re-check claims instead of only reading the poster.

Three analogies for the same idea:

Detective vs. Desk Clerk: A desk clerk reads and stamps forms; a detective leaves the desk, gathers evidence, and cross-checks alibis. DREAM is the detective.
Fresh Groceries: Old rubrics are like a shopping list from last month. DREAM re-checks what’s fresh today before deciding if your meal plan (report) is good.
Ref with Replay: A ref using instant replay (web tools) sees what really happened, not just what it looked like at full speed. DREAM uses replay for facts and logic.

Before vs. After:

Before: Judges rewarded fluency and citation text match. Time decay and extrinsic truth slipped by.
After: Judges test time-sensitive facts (KIC), probe reasoning steps (RQ), and check truth beyond the given citations (Factuality), while still grading writing and citation behavior.

Why it works (intuition):

If the task itself needs web search, then the judge must also search. That’s capability parity.
Turning “coverage” into yes/no questions from up-to-date sources (KIC) forces reports to mention what matters now.
Turning “reasoning” into step-checked plans (RQ) exposes circular logic and missing support.
Checking truth against the live web (Factuality) defeats the trap where a wrong claim cites a matching but outdated page.

🍞 Hook: You know how a good game plan breaks a big goal into clear plays?

🥬 The Concept (Building Blocks): DREAM is built from smaller pieces that each guard a failure point. How it works:

Static Metrics: Writing Quality (readability/organization), Factuality (truth vs. the world), Citation Integrity (attribution + faithfulness), Domain Authoritativeness (source reputation).
Adaptive Metrics: Key-Information Coverage (KIC) builds a fresh checklist of must-mention facts; Reasoning Quality (RQ) builds challenge questions plus a verification plan.
Evaluators: LLM Evaluator (no tools) for reading judgments; Agent Evaluator (with tools) for RQ; Workflow Evaluator for citation checks and domain credibility. Why it matters: Each block fixes a blind spot—together they see what old benchmarks missed. 🍞 Anchor: Like checking a bike: pump the tires (writing), test the brakes (reasoning), tighten the chain (citations), and confirm the helmet is certified (domain authority).

03Methodology

At a high level: Research Query → Phase 1: Protocol Creation (Static + Adaptive metrics) → Phase 2: Protocol Execution (route each metric to the right evaluator) → Final Scores.

Prerequisite concepts with sandwich explanations:

🍞 Hook: Think of a standard set of classroom rules used for every assignment. 🥬 The Concept (Static Metrics): Static metrics are universal checks that apply to any report. How it works: They include Writing Quality, Factuality (truth vs. world), Citation Integrity (attribution + faithfulness), and Domain Authoritativeness (source reputation). Why it matters: Without them, we’d miss basic presentation and source hygiene. 🍞 Anchor: Every essay needs clear writing and reliable references—no matter the topic.

🍞 Hook: For a field trip, you bring a custom checklist (raincoat if it might rain) instead of the same list every time. 🥬 The Concept (Adaptive Metrics): Adaptive metrics are built fresh for the exact question, capturing time-sensitive and topic-specific expectations. How it works: The agent researches the query and converts key facts into yes/no checks (KIC), and it designs hard reasoning probes plus a verification plan (RQ). Why it matters: Without adaptation, rubrics stay blind to what changed this week. 🍞 Anchor: Packing sunscreen for a sunny trip and umbrella for a rainy one—customized prep wins.

Phase 1: Protocol Creation (how the recipe starts)

Static Metrics are set up:

Writing Quality (WQ): A fixed rubric scores clarity of ideas and content, organization, and sentence fluency.
Factuality: Extract top claims, create neutralized web queries (to avoid confirmation bias), collect supporting and opposing evidence, then judge each claim as supported, partial, contradicted, or unverifiable.
Citation Integrity (CI): • Claim Attribution (CA): What fraction of verifiable claims have explicit citations? • Citation Faithfulness (CF): For cited claims, does the source text really support the claim? • Combine CA and CF so you must both cite and cite correctly.
Domain Authoritativeness (DA): Check if cited domains are reputable (e.g., government, academic, established news) and average their credibility.

What breaks without each step:

Without WQ: Good content gets lost in messy writing.
Without Factuality: Plausible but false claims sneak through.
Without CI: Either no citations (opaque) or lots of wrong citations (misleading) go unpunished.
Without DA: Shaky blogs count the same as gold-standard institutions.

Adaptive Metrics are created by a tool-using agent:

Key-Information Coverage (KIC): • The agent searches up-to-date sources. • It identifies essential, time-sensitive facts and turns each into a yes/no checklist item. • This converts “did you cover what matters now?” into a grounded test.
Reasoning Quality (RQ): • The agent writes challenging, query-specific questions. • It prepares a validation plan: extract the report’s reasoning chain, check external sources, and compare carefully. • The plan ensures we evaluate the logic itself, not just its surface.

Example with real data (TikTok legal status): KIC asks, “Does the report mention the current divestiture deadline (Jan 23, 2026)?” If the report uses old dates, it fails that checklist item.

Phase 2: Protocol Execution (running the tests)

🍞 Hook: Different tools for different jobs—use a ruler to measure, a scale to weigh. 🥬 The Concept (Evaluator Routing): DREAM sends each metric to the simplest evaluator that has the needed capability. How it works:

LLM Evaluator (no tools): Grades Writing Quality and checks KIC presence in the text.
Agent Evaluator (with tools): Executes RQ validation plans, retrieving fresh evidence to test logic.
Workflow Evaluator: Runs pipelines for Factuality, Citation Integrity, and Domain Authoritativeness. Why it matters: Using the right tool keeps judging accurate and efficient. 🍞 Anchor: You don’t use scissors to hammer a nail; you pick the proper tool.

Detailed, step-by-step “recipes”

Writing Quality: Read the report; score ideas/content, organization, sentence fluency using a fixed rubric; average the parts.
Factuality:
1. Extract 30 salient claims.
2. Write neutral search queries (e.g., “current divestiture deadline for TikTok” instead of copying the claim).
3. Retrieve multiple sources; pull both supporting and opposing snippets.
4. Judge each claim (supported/partial/contradicted/unverifiable) and average across claims.
Citation Integrity:
1. Extract verifiable claims and their citations.
2. Compute Claim Attribution: fraction with citations.
3. For cited ones, test source text vs. claim for faithfulness.
4. Combine attribution and faithfulness so agents must both cite and cite correctly.
Domain Authoritativeness:
1. Collect unique domains from citations.
2. Judge each domain’s reputation with a rubric (e.g., academic/government high; personal blogs low).
3. Average to get a task score.
KIC (Adaptive):
1. Agent searches the live web.
2. Extracts essential current facts.
3. Writes yes/no checklist items grounded in those facts.
4. LLM checks if the report covers each item.
RQ (Adaptive):
1. Agent drafts challenging questions.
2. Builds a validation plan: extract reasoning chain, gather external evidence, and compare.
3. Executes the plan with tools and deducts points for logical gaps.

The secret sauce

Convert vague goals (“be comprehensive”) into concrete, verifiable items (KIC checklist) built from today’s facts.
Evaluate logic by plan-and-verify steps (RQ), not just by how convincing the prose sounds.
Separate “does the source text match?” from “is the claim actually true in the world?” so faithful-but-false claims are caught.

Concrete anchors

If a report on TikTok omits “Jan 23, 2026 deadline,” KIC flags it.
If a report argues “A causes B” but sources don’t support the link, RQ downgrades it.
If a report cites a neat-looking blog for a medical claim, DA lowers the score even if the wording matches.

04Experiments & Results

🍞 Hook: If a smoke alarm can’t detect smoke, it’s not much of an alarm. We tested whether DREAM actually “smells” the hidden problems old tests miss.

The Test: What and why we measured

We checked if the agent-built protocols (KIC and RQ) are clear, relevant, and verifiable to humans.
We tested time awareness: does KIC penalize outdated reports?
We tested reasoning detection: does RQ catch subtle logic errors hidden in fluent writing?
We tested truth beyond citations: can Factuality catch well-cited but false claims?
We checked whether writing can be graded reference-free yet align with human preference.

The Competition: We compared DREAM to popular benchmarks like DeepResearch Bench (RACE) and citation alignment (FACT).
The Scoreboard (with context)

Human study on protocols: Annotators gave the full agent-with-retrieval version top scores (about 0.92–0.93 on a 0–1 scale) for relevance, clarity, verifiability, and validation quality. That’s like getting an A when the simpler LLM baseline gets a C–B range.
Temporal awareness: On 20 time-volatile queries, DREAM–KIC dropped strongly as info got older (e.g., from about 79 for current to ~22 for Jan 2024), clearly signaling staleness. DRB–RACE barely moved—like a thermometer stuck near room temperature.
Reasoning flaws: We created pairs of reports where one had injected logical errors but kept great writing. RACE’s scores barely changed (~9% drop), sometimes even praising the flawed one. DREAM–RQ cut scores by ~40% on average—like differentiating an A essay from a D when reasoning breaks.
Factuality vs. citation: We fed pairs with correct claims and plausible-but-false claims that still matched their citations. The citation-alignment metric stayed high (fooled by matching text). DREAM–Factuality fell steadily as we added more false claims, tracking the true error rate. That’s exactly what a real fact-checker should do.
Writing Quality alignment: DREAM’s reference-free writing score correlated well (Kendall’s τ ≈ 0.6) with DRB’s human-validated rankings—solidly within typical human agreement.

Surprising findings

Fluent, well-structured reports can trick static judges into high scores even when important facts are missing or old.
Citation behavior showed two opposite failure modes in open-source agents: (a) cite often but unfaithfully, or (b) cite rarely, leaving claims unattributed. Both are risky.

Benchmarking three open agents (LangChain Open DR, Smolagents Open DR, Tongyi Deep Research)

All struggled with Citation Integrity for different reasons (low attribution or low faithfulness).
Smolagents often led on Writing Quality, Factuality, KIC, and RQ, yet still had near-zero citation discipline on some datasets.
Relative rankings stayed stable across different judge backbones (Claude, DeepSeek, Kimi), showing robustness.

Bottom line: DREAM is much more sensitive to what truly matters—time, truth, and reasoning—while staying practical and scalable without needing a gold “answer” report.

05Discussion & Limitations

🍞 Hook: Even the best toolbox has limits—you still need power and materials to build the house.

Limitations (be specific)

Tool dependency: If search engines or APIs are slow, down, or biased, agentic checks can suffer.
Cost and latency: Multi-step verification takes more compute and time than a single static judge call.
Scope: DREAM grades final outputs, not the agent’s research journey (like search efficiency or source discovery skill).

Required resources

Reliable web search and content retrieval (news, papers, docs).
An LLM backbone for judging and an agent framework for tool use.
Budget and time for multi-step evaluations, plus caching to save repeated checks.

When NOT to use DREAM

Fully closed-book tasks where truth is defined by a fixed, static dataset (no live updates needed).
Extremely time- or cost-constrained settings where only a quick, shallow signal is acceptable.
Highly subjective evaluations (e.g., pure creative style) where external truth checking is irrelevant.

Open questions

Process evaluation: How to score the research path (good source discovery, efficient browsing) in addition to the final product?
Optimization: Can we smartly skip or cache steps to cut cost while keeping sensitivity high?
Adversarial robustness: How to resist coordinated misinformation or SEO spam targeting evaluators?
Domain adaptation: How to tune DA (domain credibility) across specialized fields without penalizing niche but authoritative sources?

🍞 Anchor: Think of DREAM as a high-precision lab test—more accurate than a quick strip test, but it needs equipment, time, and careful handling.

06Conclusion & Future Work

Three-sentence summary: Deep research reports can look excellent while hiding outdated facts or weak logic—the Mirage of Synthesis. DREAM fixes this by giving the evaluator agent-like powers (capability parity) to search, verify, and probe reasoning using both static and adaptive metrics. Experiments show DREAM is far more sensitive to time decay, factual errors, and reasoning flaws than popular static benchmarks.
Main achievement: Turning evaluation itself into an agentic, tool-using process that builds query-specific checks (KIC, RQ) and verifies truth beyond citation alignment.
Future directions: Add process-level scoring for research trajectories, reduce cost with smart caching and selective checks, and harden against adversarial misinformation. Also refine domain authority scoring for specialized niches.
Why remember this: As AI agents research on the live web, judging them with frozen rubrics is like grading current events with last year’s newspaper—DREAM updates the judge so it stays fair, factual, and future-ready.

Practical Applications

•Newsrooms: Vet AI-generated briefings against the live web to avoid publishing outdated or false claims.
•Healthcare content: Check medical claims beyond citations to ensure they align with current guidelines.
•Corporate research: Validate market analyses for up-to-date figures and sound causal reasoning before decisions.
•Education: Grade student reports with adaptive checklists that reflect the latest developments.
•Policy analysis: Ensure legal timelines, rulings, and stakeholder positions are current and accurately represented.
•Product documentation: Confirm technical guides cite authoritative sources and present correct, current info.
•Search/RAG systems: Benchmark retrieval and synthesis quality with KIC and RQ to guide model improvements.
•Compliance teams: Monitor regulation changes with time-aware coverage checks to prevent risk from stale info.
•Agent development: Use DREAM scores to pinpoint weaknesses (e.g., citation faithfulness vs. attribution).
•Fact-checking platforms: Automate reference-free truth verification that catches well-cited falsehoods.

Version: 1