CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era

Zhengqing Yuan; Kaiwen Shi; Zheyuan Zhang; Lichao Sun; Nitesh V. Chawla; Yanfang Ye

CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era

Beginner

Zhengqing Yuan, Kaiwen Shi, Zheyuan Zhang et al.2/26/2026

arXiv

Key Summary

•The paper tackles a new integrity problem in science: large language models sometimes invent realistic-looking citations that do not exist.
•It introduces CiteAudit, the first open, large benchmark and a multi-agent system to check whether each citation truly matches a real paper.
•The system splits the job into clear roles—extracting citation text, searching the web, checking Google Scholar, matching passages, reasoning in context, and making a careful final judgment.
•CiteAudit’s dataset mixes human-validated real citations with controlled, fake ones that mimic real-world mistakes in titles, authors, venues, years, and identifiers.
•Across both a generated test set (3,586 real, 2,500 fake) and a real-world set (2,889 real, 467 fake), the framework beats strong baselines in accuracy, precision, recall, and F1.
•Ablation studies show each agent matters: removing the Scholar Agent hurts recall a lot, swapping the LLM Judge for simple string code causes many false alarms, and skipping web search makes it much slower.
•The method is efficient and auditable: it uses an SOP (a strict recipe) to route tasks, caches verified results, deep-crawls sources, and requires strict field-by-field matches.
•This work gives researchers, reviewers, and publishers practical tools to spot fake or mismatched references before publication.
•By standardizing evaluation and providing transparent evidence, CiteAudit helps restore trust in scientific references in the LLM era.

Why This Research Matters

Citations are how science shows its homework, so verifying them protects the foundation of research. As LLMs become common writing partners, they can accidentally invent realistic references that mislead reviewers and readers. CiteAudit gives journals, conferences, and authors a way to catch these problems early, with transparent evidence and explanations. It can save reviewers hours, reduce retractions, and improve the quality of literature students and practitioners rely on. In medicine and policy, catching fake or mismatched references can prevent real-world harm. Over time, shared benchmarks and SOPs help the community improve tools in a fair and open way.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how a teacher asks you to show your work in math, so they can see where your answer came from? In science, citations are how researchers show their work.

🥬 The Concept: A scientific citation is a pointer that tells readers exactly which earlier paper supports a new claim. How it works:

An author makes a claim.
They add a citation that should point to a real, matching paper.
Readers can look up the citation to check the evidence. Why it matters: Without correct citations, no one can tell if a claim is backed by real evidence, and trust in science breaks. 🍞 Anchor: If a paper says “A vaccine is 95% effective” and cites a study, that study must actually exist and report that result.

🍞 Hook: Imagine a very smart robot helper that can write essays fast, but sometimes it confidently makes up things that sound real.

🥬 The Concept: A Large Language Model (LLM) is a computer program trained to predict and generate text. How it works:

It reads huge amounts of text.
It learns patterns of which words often come next.
It uses those patterns to answer questions or write passages. Why it matters: Because it predicts what “sounds right,” it can sometimes invent realistic but false details—like fake citations. 🍞 Anchor: If you ask an LLM for sources and it invents a book that no library has, that’s a problem.

🍞 Hook: Imagine following a treasure map that looks perfect—but the island doesn’t exist.

🥬 The Concept: Citation hallucination is when a reference looks real (title, authors, venue), but no such paper exists or its details don’t match. How it works:

The text contains a normal-looking citation.
You try to find it in trusted databases.
Either it doesn’t exist, or key details (title, authors, venue, year, DOI) disagree with the record. Why it matters: It breaks the evidence chain and can fool authors, reviewers, and readers. 🍞 Anchor: A paper cites “Smith & Lee, 2022, Journal of AI Safety” with a cool title, but Google Scholar can’t find it and the DOI is fake.

The world before: For a long time, checking citations meant humans skimming reference lists and spot-checking a few entries. That worked when papers had shorter bibliographies and tools were simpler. Automated checkers mostly matched strings (like titles) to databases. If the title was clean and exact, they did fine.

The problem: As LLMs entered writing and reviewing, a new risk appeared. LLMs can produce well-formed but fabricated references—complete with polished titles, author names, and realistic venues. Meanwhile, reference lists keep getting longer across fields, so manual checking doesn’t scale. Reviewers and editors face time pressure; a single fake citation can slip through, weakening arguments and trust.

Failed attempts: Early tools relied on exact or fuzzy matching of text fields (e.g., title similarity). But real-world references are noisy: name abbreviations, paraphrased titles, venue name variants, preprint vs. conference versions, or small typos. Fuzzy matching sometimes misses real citations (too strict) or accepts fakes (too loose). Many tools are also closed-source, so their true accuracy and methods are hard to judge.

The gap: The community lacked (1) a robust, open benchmark with both carefully crafted fake citations and real-world messy ones; and (2) a transparent, multi-step verification process that doesn’t just compare strings but also retrieves evidence, reasons about context, and makes calibrated, explainable decisions.

Real stakes: This matters to everyone who relies on science—doctors reading studies, policymakers shaping rules, engineers building systems, and students learning from papers. If LLM-generated hallucinated citations enter the literature, reviewers may be misled, co-authors could be blamed for errors they didn’t catch, and readers may repeat false claims. Fixing citations strengthens the whole chain of knowledge.

This paper’s response: The authors introduce CiteAudit, both a benchmark and a multi-agent system that treats citation-checking like a team sport: one agent extracts the citation cleanly from PDFs, another searches the web, another consults scholarly databases, another matches fields strictly, and a judge agent makes the final, explainable call. The benchmark includes human-validated real and fake entries across many error types (title, authors, venue, year, identifiers), and the evaluation protocol scores both decision quality and evidence alignment. Together, they provide the missing foundation for testing and improving citation verification at scale.

02Core Idea

🍞 Hook: Imagine a relay team where each runner does one job really well, passes the baton cleanly, and the team wins together.

🥬 The Concept: The key idea is to split citation checking into a team of specialized agents that extract, retrieve, match, reason, and judge under a strict playbook. How it works:

Extract the exact citation from the PDF (no distortions).
Search memory and the web to gather real evidence.
Strictly match fields (title, authors, venue, year/ID) to the evidence.
Use context-aware reasoning to handle real-world noise.
Make a calibrated final verdict with an explanation. Why it matters: One big model acting alone tends to be brittle or opaque. A coordinated team with rules is more accurate, faster, and explainable. 🍞 Anchor: It’s like airport security: ID check, baggage scan, manual inspection if needed, then a final clear-or-stop decision with reasons recorded.

Aha! moment in one sentence: Treat citation verification as a multi-agent, evidence-grounded pipeline with strict rules and a shared benchmark, instead of as a single fuzzy text-matching task.

Explain it three ways:

Detective squad: One detective finds the suspect’s info (extractor), another searches archives and CCTV (web/scholar), a third compares fingerprints (matcher), the analyst checks the story (reasoner), and the judge decides (judge agent).
Assembly line: Each station adds one precise check—scan, fetch, align, explain—so defects (fake citations) can’t slip through.
Science fair rubric: Everyone uses the same checklist to grade projects, so scores are fair, comparable, and reproducible.

Before vs. after:

Before: Tools mostly matched strings and struggled with typos, paraphrases, and mixed versions. Evaluation was scattered or closed.
After: A strict SOP coordinates multiple agents that gather real evidence, align fields exactly, and explain decisions. A shared, human-validated benchmark fairly compares methods across realistic error types.

Why it works (intuition, not equations):

Decomposition reduces errors: simpler steps are easier to get right and to audit.
Evidence grounding beats guessing: pulling real pages and canonical records prevents the model from trusting plausibility alone.
Strict-but-smart matching: titles and authors must match fully, with sensible venue/year rules, so fake lookalikes don’t pass.
Cascading search saves time: quick memory hits are instant; most cases resolve via web evidence; only tough ones reach slow, authoritative checks.

Building blocks (as mini-sandwiches):

🍞 Hook: You know how copying from a blurry photo can change a number? That’s risky. 🥬 The Concept: Extractor Agent is the tool that converts messy PDF references into clean, structured fields (title, authors, venue, year/URL) with no changes. How it works: 1) OCR/vision reads the page. 2) It finds the References section. 3) It fills a JSON schema exactly. Why it matters: If extraction is wrong, every later step compares against the wrong thing. 🍞 Anchor: The agent turns “Zhang, Y. et al., 2022, arXiv:2205.12345” into precise fields the rest can trust.

🍞 Hook: Think of a library card catalog that remembers what was already checked. 🥬 The Concept: Memory Agent is a fast cache that returns a previous verified result if the same citation appears again. How it works: 1) Embed the citation. 2) Compare to stored embeddings. 3) If it’s a strong match, reuse the verdict. Why it matters: Saves time and money on repeated checks. 🍞 Anchor: If five papers cite the same classic study, it verifies once, then fast-passes the rest.

🍞 Hook: When you’re not sure, you google it and open the actual pages. 🥬 The Concept: Web Search Agent fetches real web pages about the citation and deep-crawls content, not just snippets. How it works: 1) Query. 2) Download top pages. 3) Extract relevant text. Why it matters: Snippets can mislead; full pages carry the facts. 🍞 Anchor: It opens the author’s homepage, arXiv page, and publisher record to compare details.

🍞 Hook: If a school rule says your name on the test must match the roster exactly, you can’t fudge it. 🥬 The Concept: Strict Consistency Criterion is the rule that title and authors must match completely, with sensible venue/year rules. How it works: 1) Check exact title (ignoring case and tiny words). 2) Check all authors appear. 3) Apply venue/year rules (e.g., arXiv-to-conference allowed). Why it matters: Prevents near-miss fakes from passing. 🍞 Anchor: If the title swaps two key terms or an author is missing, it fails.

🍞 Hook: A judge listens to evidence and explains the verdict. 🥬 The Concept: Judge Agent makes the final decision by applying the strict rules to the collected evidence and writing a brief reason. How it works: 1) Read citation fields. 2) Compare to web/scholar content. 3) Output true/false plus a note. Why it matters: Clear, auditable outcomes. 🍞 Anchor: “Match: false; reason: author list differs; ground-truth shows three authors, citation lists two.”

🍞 Hook: If you’re unsure after googling, you check the official school records. 🥬 The Concept: Scholar Agent queries authoritative databases (like Google Scholar) to fetch canonical metadata when needed. How it works: 1) Low-frequency, precise crawl. 2) Retrieve the official entry. 3) Provide ground truth. Why it matters: Stops “plausible on the web” fakes. 🍞 Anchor: A citation that seems real on blogs is rejected after Scholar shows no such paper exists.

🍞 Hook: Recipes help everyone cook the same dish. 🥬 The Concept: Unified Evaluation Protocols are shared rules and metrics to score systems fairly. How it works: 1) Use a common dataset. 2) Measure accuracy, precision, recall, F1, and evidence alignment. 3) Report consistently. Why it matters: Apples-to-apples comparisons speed progress. 🍞 Anchor: Two tools judged on the same fake-title set can be fairly compared.

Put together, these pieces turn citation checking from guesswork into an evidence-backed, explainable pipeline grounded in a public benchmark.

03Methodology

At a high level: PDF with references → Extract structured citation fields → Fast memory check → Web search and deep-crawl evidence → Strict matching and judgment → If needed, authoritative Scholar check → Final verdict with explanation and cache update.

Stage-by-stage (with purpose, what happens, and what breaks without it):

Citation Metadata Extraction (Extractor Agent)

What happens: A vision-enabled model reads the PDF, locates the References section, and turns each citation into a JSON record: title, authors, venue, year, URL/ID.
Why this exists: Downstream checks must compare clean fields. If the extractor guesses or paraphrases, later matching will be wrong.
Example: From “Y. Zhang, M. Li, X. Chen. Automated contract clause generation using pre-trained language models. arXiv:2205.12345 (2022).” we get {title: “…generation using pre-trained language models”, authors: [Zhang,Y; Li,M; Chen,X], venue: arXiv, year: 2022, url: arxiv.org/abs/2205.12345}.
What breaks without it: Raw text strings vary in punctuation and line breaks; exact comparison becomes unreliable and brittle.

Verified Memory Querying (Memory Agent)

What happens: The system embeds the citation and compares it with a cache of previously verified entries. If similarity exceeds a high threshold, it reuses the verdict.
Why this exists: Many classic papers are cited repeatedly; caching avoids redoing web/scholar checks, saving time and cost.
Example: If “ResNet 2015 He et al.” was verified yesterday, today’s nearly identical entry is instantly passed.
What breaks without it: Repeated citations waste retrieval and crawling budget, slowing the pipeline.

Web-based Retrieval (Web Search Agent)

What happens: For uncached citations, the agent calls a web search API, downloads the top-5 pages, and deep-crawls content instead of trusting snippets.
Why this exists: Real evidence often lives on author pages, arXiv, and publishers; snippets can omit crucial details (like full author lists).
Example: The search returns an arXiv page, an author’s CV, and a conference page; the crawler grabs the full text blocks around title, authors, venue, and IDs.
What breaks without it: The system either guesses from partial information or over-rejects real citations due to missing context.

Strict Matching and Contextual Judgment (Judge Agent with Strict Consistency Criterion)

What happens: The Judge compares citation fields to the retrieved evidence under strict rules: exact title match (ignoring case/tiny words), complete author inclusion, and sensible venue/year logic (e.g., arXiv preprint to later conference is OK; wrong conference is not). It outputs match true/false plus a short note.
Why this exists: Pure string code is too brittle; the LLM Judge handles small real-world variations while staying strict about identity.
Example: If the citation’s title says “graph neural net optimization” but the real paper says “graph neural network optimization,” that might pass; if it says “graph attention optimization,” that fails the title identity.
What breaks without it: Overly strict code flags many real citations; overly loose matching lets fakes slide through.

Scholar Retrieval & Final Verification (Scholar Agent)

What happens: If web evidence is insufficient or conflicting, the system queries authoritative databases (e.g., Google Scholar) to fetch the canonical record and re-check.
Why this exists: Some fakes look convincing on general web pages; authoritative catalogs reduce false accepts and finalize tough cases.
Example: A blog mirrors a made-up title; Scholar shows no official record—verdict: hallucination.
What breaks without it: Plausible-but-fake entries get accepted; recall drops on resilient hallucinations.

SOP Orchestration and Planning Model

What happens: A controller enforces a fixed order: Memory → Web → Scholar. It distributes work in parallel threads and updates the cache on success.
Why this exists: Guarantees consistency, speed, and auditability; prevents agents from skipping steps or making ad-hoc choices.
Example: A verified match updates the memory so later duplicates are instant.
What breaks without it: Unpredictable behavior, wasted compute, and inconsistent outcomes.

Mini-concepts as quick sandwiches:

🍞 Hook: Imagine the librarian’s rulebook for checking a book’s identity. 🥬 The Concept: Strict Consistency Criterion is the field-by-field identity test for title and authors, with sensible venue/year rules. How it works: 1) Title identity. 2) All authors present. 3) Venue/year logic. Why it matters: Stops near-miss fakes. 🍞 Anchor: Wrong author list? Fail.

🍞 Hook: If the phone memory already knows a number, you don’t dial directory assistance. 🥬 The Concept: Cascade retrieval uses fast then slow steps: memory, then web, then scholar. How it works: 1) Try cache. 2) Try web evidence. 3) Escalate to authoritative source. Why it matters: Faster and cheaper. 🍞 Anchor: Most citations resolve at step 2; only the tricky ones reach step 3.

🍞 Hook: A referee explains every foul call. 🥬 The Concept: Calibrated Judgment means the final verdict is strict, consistent, and comes with a short reason. How it works: 1) Compare fields. 2) Apply rules. 3) Output decision + note. Why it matters: Transparency builds trust. 🍞 Anchor: “Title mismatch on keyword ‘Bayesian’; ground-truth uses ‘Frequentist’.”

Secret sauce—what makes it clever:

Evidence over plausibility: It reads actual pages and canonical records.
Strict identity with humane tolerance: It allows small format noise but not semantic drift.
Deterministic SOP: Clear steps make it reproducible and auditable.
Cost-speed balance: Caching and staged retrieval keep it fast without losing accuracy.

Input → Output walkthrough example:

Input citation: “Doe, J.; Patel, R. ‘Efficient Vision Transformers for Small Devices,’ NeurIPS 2023, DOI:10.xxxx/abcd.”
Step A (Extract): Structured fields captured.
Step B (Memory): No prior hit.
Step C (Web): Crawl returns an arXiv page and a workshop page; titles differ slightly; authors match.
Step D (Judge): Title mismatch with NeurIPS 2023; venue seems workshop, not main conference.
Step E (Scholar): No record of that exact title at NeurIPS 2023; canonical version is a 2024 journal article by the same authors.
Output: Hallucination (venue and year mismatch beyond allowed rules), note explains why, cache updated to prevent repeats.

04Experiments & Results

The test: The authors measured how well different systems can spot fake citations (hallucinations) while keeping real ones, using accuracy, precision, recall, and F1. They also timed how long it takes to check 10 citations and estimated costs for API-based models. This matters because in real peer review, you need both correctness and speed, and budgets are limited.

The competition: They compared strong open and proprietary LLM systems and commercial detectors to their multi-agent framework. Baselines included models like GPT-5.2, Claude Sonnet 4.5, Gemini 3 Pro, Mixtral, Llama, Qwen variants, and GPTZero-style tools. The idea was to test whether general-purpose models or simple detectors could match a purpose-built, evidence-grounded pipeline.

The scoreboard (generated benchmark: 3,586 real, 2,500 fake):

Many baselines tilted either too strict (flagging too many real citations as fake) or too loose (letting fake citations pass). For example, some models had good precision but missed many fakes (low recall), while others caught most fakes but triggered many false alarms on real references.
Their multi-agent method achieved very high balanced performance (near the top across accuracy, precision, recall, and F1), correctly identifying all fabricated references in the generated set while preserving most real citations. In plain terms: like getting an A+ while others hovered around B’s and C’s, often for either over-flagging or under-catching.
Efficiency: Their pipeline processed batches quickly and at effectively zero per-token cost (due to local deployment and lightweight agents), beating commercial per-token pricing by a wide margin.

The scoreboard (real-world benchmark: 2,889 real, 467 fake):

Real-world noise made baselines wobblier: aggressive models over-rejected genuine citations; permissive ones let fakes slip through. This mirrors messy, practical conditions.
The proposed framework again led across accuracy, precision, recall, and F1, with the F1 score beating the second-best by over a third of a point—substantial in this setting. Think of it as not just winning the race, but winning by several strides.
Consistency check: The behavior of systems on the generated set matched their behavior on the real-world set, supporting that the benchmark’s fake citations realistically mimic true-world errors. A chi-square test comparing error distributions showed no significant difference, reinforcing the dataset’s fidelity.

Surprising findings:

Even powerful proprietary LLMs struggled to reliably execute transparent, verifiable external searches when used as black boxes. Sometimes they appeared to rely on internal memory or unverifiable retrieval, which is risky for citation auditing.
A strict but context-aware judge mattered a lot. Replacing the LLM Judge with simple code for exact string matching caused precision to crash: it caught almost all fakes but wrongly punished many real citations—too many false alarms.

Ablation insights (why each agent counts):

Without the Scholar Agent: Recall dropped sharply. Some fakes look legit on the general web; only authoritative catalogs exposed them.
Without the Judge Agent (code-only): Precision collapsed, F1 plummeted. Real-world noise (name initials, minor title variants) fooled the simplistic rules.
Without Web Search: Latency ballooned (about $8× slower$ ) because every case hit the slow, rate-limited scholarly crawl. The web step is a fast filter that keeps the system practical.

Human meaning of the numbers:

High recall: It catches most of the fake citations, so fewer bad references slip through to publication.
High precision: When it flags a citation as fake, it is usually right, so reviewers won’t waste time chasing false alarms.
High F1: Balanced strength; it’s good at both catching fakes and protecting real citations.
Lower time and cost: Journals and conferences can scale checks without blowing deadlines or budgets.

Bottom line: The carefully engineered, evidence-first, multi-agent approach outperformed single-model and string-matching baselines, especially in noisy, real-world conditions, while being explainable and efficient.

05Discussion & Limitations

Limitations:

Coverage limits: Even Google Scholar and publisher pages can have gaps, delays, or regional restrictions. Very new works, non-English venues, or paywalled-only items may be harder to verify.
Strictness trade-offs: The system insists on identity for title and authors. That’s good for precision, but unusual legitimate variants (e.g., special characters, transliteration differences) can still cause rejections unless the judge’s reasoning accommodates them.
Scope focus: The method verifies citation identity (does this reference exist and match fields?). It doesn’t yet check whether the cited source truly supports a specific claim inside the paper’s text at a deep semantic level.
Resource needs: Running OCR/vision, web crawling, and scholar queries requires compute, careful rate limiting, and stable internet access. At very large conference scales, orchestration and caching strategies become critical.
Human validation bottleneck: The benchmark labels were human-validated for quality. Growing the dataset further requires more human time, though tooling can reduce the load.

Required resources:

A GPU-enabled server (for vision extraction and the judge), a scalable crawler, and access to search and scholarly services. Storage for the memory cache and logs to enable audits.

When not to use:

Offline-only environments with no web/scholar access.
Domains where references are primarily non-textual (e.g., images-only catalogs) or where official identifiers are sparse.
Situations demanding deep claim-evidence alignment within the cited paper (this work prioritizes metadata identity and existence).

Open questions:

Claim-level support: How to connect each in-text claim to the exact passage in the cited paper and decide “supports/contradicts/irrelevant” robustly?
Multilingual and cross-script robustness: How to handle titles and author names across scripts and transliterations without losing identity strictness?
Beyond CS/AI: How does performance shift in medicine, law, or humanities where venues and conventions differ widely?
Living documents: How to handle preprint-to-final transitions, title changes, and versioning across years without allowing semantic drift?
Governance and ethics: What standards should journals adopt to log, share, and act on citation-audit results fairly for authors and reviewers?

06Conclusion & Future Work

Three-sentence summary: This paper introduces CiteAudit, the first open benchmark and a multi-agent, SOP-driven system to verify whether scientific citations truly exist and match their claimed metadata. By splitting work across specialized agents (extract, search, match, judge, scholar) and grounding decisions in real evidence, the approach outperforms strong baselines on both controlled and real-world datasets. The result is faster, more accurate, and more explainable citation auditing suitable for researchers, reviewers, and publishers.

Main achievement: Turning citation checking from fuzzy, string-level guessing into a transparent, evidence-grounded pipeline evaluated on a shared, human-validated benchmark—demonstrably improving accuracy, recall, and F1 while keeping latency and cost low.

Future directions:

Move from identity checks to full claim-evidence alignment inside the cited papers (support/contradict/irrelevant).
Strengthen multilingual handling and cross-script normalization for global coverage.
Expand benchmarks to more fields (biomedicine, social sciences) and edge cases (preprint-to-journal evolutions).
Add provenance-rich reporting and APIs for seamless journal and conference integration.

Why remember this: In the LLM era, good-looking citations can be untrue. CiteAudit provides both the map (a public benchmark) and the compass (a multi-agent verifier) to keep scientific references honest, scalable, and trustworthy.

Practical Applications

•Journal submission gate: automatically audit reference lists on upload and flag risky entries for editors.
•Conference reviewer assistant: provide verified evidence and mismatch notes next to each citation in a paper.
•Author-side plugin: pre-check your manuscript’s references before submission to avoid desk rejections.
•Grant review support: verify supporting studies in proposals to ensure claims rest on real, matching sources.
•Systematic review tooling: filter out fabricated or mismatched citations before screening studies.
•University library service: batch-audit theses and dissertations for citation integrity with reports.
•Corporate and government reports: verify references in policy and technical documents to prevent misinformation.
•RAG pipeline guardrail: check that citations attached to generated answers correspond to real sources.
•Publisher production QA: final reference validation before typesetting to reduce errata and retractions.

Version: 1