QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs
Key Summary
- âąThis paper shows that when AI models grade university-level math proofs, they often disagree with human experts in systematic ways.
- âąThe authors built QEDBENCH, a big, carefully checked set of 272 proof problems and 1,300+ model-written solutions, graded by PhD-level experts.
- âąThey use two grading styles (a course-style rubric and an expert-correctness rubric) to see where AI judges go wrong.
- âąSome AI judges inflate grades a lot, passing proofs that humans would fail, especially in discrete math areas like combinatorics and graph theory.
- âąGemini 3.0 Pro solved and explained many problems very well, but several other top models struggled badly on constructive, discrete problems.
- âąEven strict instructions in the prompt (the course-specific rubric) usually donât fix AI judges; their built-in habits dominate (rubric insensitivity).
- âąA key measure, the alignment gap, shows how far an AI judgeâs score is from human scores and reveals domain-specific biases.
- âąThe study finds little evidence that data contamination drove results; success mostly reflects real reasoning, not memorization.
- âąThe authors release QEDBENCH publicly so everyone can calibrate and improve AI judges for advanced math.
- âąBottom line: generation is strong, but verification is the new bottleneck; better process-aware judging is needed.
Why This Research Matters
If AI judges over-reward pretty but wrong math, students and researchers will learn the wrong lessons. QEDBENCH shows exactly where todayâs judges go off track, so we can fix them. It helps teachers, contest organizers, and tool builders know when to trust automated grading and when to ask a human or a formal prover. It also protects future AI training from bad reward signals that favor style over truth. As AI increasingly participates in STEM education and research, reliable evaluation is as important as solving. QEDBENCH gives the community a public, expert-anchored way to measure and improve that reliability.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Hook: You know how teachers donât just check if your final answer is right, they also check if your steps make sense? For hard math, the steps matter even more than the final box.
đ„Ź The Concept (Proof-Based Mathematics Evaluation): What it is: Itâs grading math work by checking if every step in a proof follows logically, not just whether the last line matches the answer. How it works (recipe): 1) Read the claim to be proved. 2) Check each step for correct definitions and logic. 3) Make sure no hidden assumptions sneak in. 4) Confirm the conclusion truly follows from the steps. Why it matters: Without careful step-checking, someone could write a fancy-looking proof that sounds smart but actually skips a key step, and it would still get full credit. đ Anchor: For example, saying âAll even numbers are divisible by 2; 10 is even; so 10 is divisible by 2â is a valid tiny proof. But â10 is divisible by 3 because 10 is near 9â looks neat but fails logically.
The World Before: Large language models (LLMs) got really good at basic math problems and even many competition-style questions. People started asking LLMs to judge other LLMsâ mathâlike robot teachers grading robot studentsâbecause itâs faster and cheaper than asking humans each time. This method is called LLM-as-a-Judge.
đ Hook: Imagine a robot referee calling a basketball game. If it canât tell what a foul really is, it might blow the whistle at the wrong times.
đ„Ź The Concept (LLM-as-a-Judge): What it is: An AI reads a studentâs proof and assigns a score, acting like a grader. How it works: 1) Feed the proof and a grading guide (rubric) to the AI. 2) The AI checks for correctness and completeness. 3) It outputs a score and (optionally) feedback. Why it matters: If the AI judge is accurate, we can scale grading to thousands of proofs quickly. If itâs not, weâll reward wrong reasoning and punish correct work. đ Anchor: Think of a spelling bee. An AI judge listens and decides âcorrectâ or âwrong.â If it misunderstands simple rules like silent letters, it canât be trusted to pick the winner.
The Problem: When judging tough, university-level proofs, LLM judges often give scores that donât line up with human experts. Theyâre especially fooled by proofs that look good but have hidden mistakes. This makes it hard for researchers and teachers to know which solver models truly understand math.
Failed Attempts: Two main paths existed.
- Autoformalization into proof assistants (like Lean) gives rock-solid checking but is slow and expensive to scaleâturning each problem and solution into formal code takes tons of expert time.
- Plain LLM-as-a-Judge scales easily but can be unreliable: it may reward stylish writing or familiar templates over real logic.
The Gap: What was missing was a way to measure, precisely and fairly, how far AI judges are from human experts, and to understand where and why they drift.
đ Hook: Imagine using two thermometers to check a feverâone for school nurses and one for hospital doctors. If they disagree a lot, you need to know by how much and in what situations.
đ„Ź The Concept (Alignment Gap): What it is: The mismatch between an AI judgeâs score and what human experts would give. How it works: 1) Collect human expert scores. 2) Collect AI judge scores on the same proofs. 3) Compare them to see consistent differences. Why it matters: If AI judges inflate grades by a lot, students and researchers will think weak proofs are strong. đ Anchor: If a human gives a proof 70/100 but the AI gives 90/100 again and again, youâve got an alignment gap that can mislead everyone.
Real Stakes:
- In classrooms, unfair grading hurts learning and trust.
- In research, false confidence in AI proofs wastes time and can spread errors.
- In training new AI systems, bad judge scores teach models to sound persuasive instead of being right.
This paperâs answer: QEDBENCH. Itâs a large, carefully curated benchmark of 272 advanced proof problems (and 1,300+ model-generated solutions) that includes 1,000+ hours of human expert grading. It measures exactly how AI judges differ from experts and shows where models shine or struggleâespecially the big difference between smooth, âcontinuousâ math (like calculus and ODEs) and step-by-step, âdiscreteâ math (like combinatorics and graph theory).
đ Hook: Picture two kinds of mazes. One is a straight tunnel (continuous); you just keep moving forward. The other is a grid with lots of turns (discrete); you must choose your path carefully.
đ„Ź The Concept (Continuous vs. Discrete Reasoning): What it is: Two flavors of math thinkingâsmooth, formula-based vs. combinatorial, step-building. How it works: 1) Continuous often follows known templates and standard theorems. 2) Discrete often needs constructing exact examples or clever counterexamples. Why it matters: AI models often do great with templates but stumble when they must invent a specific structure step by step. đ Anchor: Using the quadratic formula (template) is continuous-like. Building a special graph that breaks a rule (construction) is discrete-like.
By telling this whole storyâfrom what used to work, to what breaks, to what QEDBENCH measuresâthe paper shows that verification (grading correctly) is now the main bottleneck for AI math, not generation (writing a proof-like text).
02Core Idea
Aha! Moment in one sentence: If we want trustworthy, automated grading of university-level math proofs, we must measure and fix the systematic difference between AI judges and human experts, using two rubrics and a cross-checking matrix that reveals where the misalignment lives.
đ Hook: You know how getting two opinions from doctors helps you understand a health problem better?
đ„Ź The Concept (Dual-Rubric Strategy): What it is: Grading each proof two waysâ(1) like a course instructor (course-specific) and (2) like a research expert (expert-correctness)âto spot different kinds of mistakes. How it works: 1) Build a clear course-style rubric (explicit definitions and steps). 2) Build an expert-style rubric (logical soundness; implicit easy steps allowed). 3) Have humans refine and use these. 4) Have AI judges grade using both. Why it matters: A single rubric can hide problems. Two rubrics reveal if judges are missing student-style rigor or expert-level soundness. đ Anchor: Itâs like grading a lab report once for neatness and following instructions, and once for scientific correctness.
đ Hook: Imagine six different coaches all judging the same ice-skating routine; if they consistently score higher or lower than the head referee, you learn each coachâs bias.
đ„Ź The Concept (EvaluatorâSolver Matrix): What it is: A grid that compares many AI judges to many AI solvers on the same problems. How it works: 1) Generate solutions from several solver models. 2) Grade them with several judge models. 3) Compare all judge scores to human expert scores across the grid. Why it matters: This isolates judge bias (how a judge tends to score) from solver skill (how good a solution is), giving a fairer picture. đ Anchor: If one judge always gives +10 points regardless of performance, youâll see that in the matrix and not blame the skater.
đ Hook: Think of trying on different pairs of glasses. Some make the world look brighter than it really is; some make it darker.
đ„Ź The Concept (Alignment Gap): What it is: The difference between AI and human scores for the same proof. How it works: 1) Take the human score as ground truth. 2) Compute the AIâs score difference. 3) Aggregate differences by topic (e.g., Analysis, Graph Theory) to find patterns. Why it matters: It tells you exactly where AI judges are too lenient (pass bad proofs) or too harsh (fail good ones). đ Anchor: If humans give 75 on average and the AI gives 90 on average in combinatorics, the AI is likely inflating grades there.
Multiple Analogies for the Core Idea:
- Two Thermometers: One measures skin temperature (course rubric), one measures core temperature (expert rubric). Using both tells you if the patient (the proof) is truly healthy.
- Airport Security: One scanner checks liquids and laptops (course rules), the other scans for risky patterns (expert correctness). Together they catch more problems.
- Double-Lens Camera: A wide lens (course rubric) checks visible steps, and a zoom lens (expert rubric) checks subtle logical connections.
Before vs. After:
- Before: People trusted single-shot AI grading and broad benchmarks; they didnât see where judges over-rewarded stylish but broken logic.
- After: With QEDBENCH, we see domain-specific biases (e.g., grade inflation in combinatorics) and the big discreteâcontinuous split. Now we know automated grading needs calibration and process-aware checks.
Why It Works (intuition, not equations): Using two rubrics prevents a single viewpoint from hiding weaknesses. Crossing multiple solvers with multiple judges prevents blaming the wrong side. Human expert scoring provides a strong anchor so we can detect consistent judge drift. This triangulation (two rubrics + many judges + many solvers) reveals not just noise, but patterns of bias.
Building Blocks of the Core Idea:
- Carefully curated proof problems that demand real reasoning, not just final answers.
- Human expert scoring on a fine-grained scale (so âalmost correctâ differs from âlooks nice but wrongâ).
- Dual rubrics to separate classroom rigor from expert-level soundness.
- A matrix () so alignment patterns canât hide.
- Analysis across ten math areas to find where judges reliably misgrade.
đ Hook: Think of a radio thatâs too loud in one station and too quiet in another; you need to adjust the knob differently depending on the channel.
đ„Ź The Concept (Statistical Calibration): What it is: Adjusting how we interpret AI judge scores by accounting for their typical bias against expert ground truth. How it works: 1) Measure each judgeâs average over/under-scoring in each domain. 2) Use these gaps to re-interpret future scores. 3) Prefer judges with smaller, more stable gaps. Why it matters: Without calibration, a flashy but lenient judge misleads us about solver quality. đ Anchor: If Judge A is usually +10 points in Graph Theory, a 90 from A might mean âreally an 80â compared to humans.
Finally, QEDBENCH also introduces two important diagnoses:
- Sycophancy Trap: Some judges reward authoritative-sounding math even when itâs irrelevant or wrong.
- Rubric Insensitivity: Even when you tell a judge to be stricter (course rubric), its internal habits barely changeâshowing prompts alone wonât fix judging. These findings point toward process supervision and better training for future AI judges.
03Methodology
At a high level: Problems â [Solutions from multiple solver models] â [Human expert scores (ground truth) using refined rubrics] â [AI judges grade the same solutions under both rubrics] â [Compare AI vs. human to measure alignment gaps and biases].
Step 1: Problem Curation and Anti-Leakage
- What happens: The team gathered 272 upper-undergrad/early-grad proof problems across ten fields (like Analysis, Abstract Algebra, Combinatorics, Graph Theory, ODEs). They rewrote problems to avoid simple memorization, and they ran an adversarial web search to flag problems with existing solutions online.
- Why this step exists: If models just memorize old solutions, youâre not testing reasoning. You need fresh wording and careful checks.
- Example: Suppose a classic Graph Theory problem asks for a path construction. They rephrase it but keep the same mathematical heart, then scan the web to ensure no identical solution exists. If a match appears, itâs taggedâlater analysis shows that even with some online matches, performance differences were statistically tiny, suggesting real reasoning drove success.
đ Hook: Imagine giving your friend a puzzle and first making sure theyâve never seen that exact puzzle before.
đ„Ź The Concept (Adversarial Data Curation): What it is: Designing and checking problems so that solving them needs genuine reasoning, not recall. How it works: 1) Source high-quality problems. 2) Reword them carefully. 3) Use automated deep search to detect close matches online. 4) Tag or remove suspicious items. Why it matters: This keeps the benchmark honestâwins come from thinking, not remembering. đ Anchor: If a puzzle is reworded and no matching solution is online, a solverâs success likely reflects real skill.
Step 2: Solution Generation (5 Solvers)
- What happens: Five cutting-edge solver models each attempt all problems, with generous token limits and retries if needed. If a model repeatedly fails to produce a proof (e.g., timeout), that attempt gets a 0, reflecting real-world deployment.
- Why this step exists: You need a diverse pool of solutionsâgood, bad, and in-betweenâto fairly test judges.
- Example: One solver writes a long Analysis proof that looks stylish but mixes up uniform and pointwise convergence. Another solver for Combinatorics tries to construct a bijection but leaves a gap. These varied attempts are perfect for testing how well judges catch subtle mistakes.
Step 3: Human Expert Annotation & Rubric Refinement
- What happens: 48 experts (PhD holders/candidates) score solutions using a fine-grained scale (like 0, 0.25, 0.5, 0.75, 0.9, 1.0). They use two types of rubrics: course-specific (pedagogical constraints) and expert-correctness. Drafts of rubrics were generated by a model, then iteratively refined by humans until they matched expert standards.
- Why this step exists: Human scores become the ground truth. The two rubrics let us separate âfollowed course rulesâ from âis logically sound at expert level.â
- Example: A proof thatâs basically correct but skips a small edge case might earn 0.9. A proof that looks nice but hides a fatal logical gap might get 0.5.
đ Hook: Think of coaches marking exactly where a gymnast wobbled on the beam, not just saying âgoodâ or âbad.â
đ„Ź The Concept (Expert Annotation): What it is: Detailed human grading that highlights the precise strengths and mistakes. How it works: 1) Experts score using a clear scale. 2) They give justifications and mark error locations. 3) They refine rubrics so scores match intended standards. Why it matters: Without a trustworthy human anchor, you canât measure how far AI judges drift. đ Anchor: If three experts say âthis step is circular,â we can compare whether AI judges notice or ignore that same flaw.
Step 4: Judge Calibration with a Matrix
- What happens: Seven judge models grade all solver outputs under both rubrics. Now we can compare each judgeâs scores to human scores, across many solvers and topics.
- Why this step exists: To separate judge behavior (lenient/harsh, domain biases) from solver skill and to measure reliability.
- Example: One judge might pass nearly everything in Combinatorics (too lenient), while another unfairly fails valid Complex Analysis proofs that skip trivial expert steps (too harsh).
Secret Sauce: Measuring the Alignment Gap and Bias Patterns
- Core quantity: the alignment gap between AI and human scores. The paper defines bias per domain and per judge, making it easy to see systematic inflation or punitiveness.
- Simple formula with example: The gap can be summarized by the difference . For example, with and , .
- Why itâs clever: Instead of arguing over single examples, we see stable, repeatable judge patterns across many problemsâlike a fingerprint of each judgeâs habits.
Rubric Insensitivity Check
- What happens: They test whether stricter prompts (course-specific rubric that penalizes forbidden shortcuts) actually change a judgeâs alignment. Surprisingly, alignment metrics barely move; judge habits dominate the prompt.
- Why it matters: This shows you canât fix judging with prompt tweaks alone. You likely need process supervision or fine-tuning on better signals.
- Example: Even when told âpenalize advanced theorems used without derivation,â a judge still gives similar scores, missing the point that the student skipped key course steps.
Error Mode Diagnostics
- Leniency (False Positives): Passing broken proofs because they look plausible.
- Harshness (False Negatives): Failing valid proofs because they donât match a familiar template.
- Sycophancy Trap: Rewarding authoritative-sounding LaTeX or even irrelevant-but-correct math blocks that donât solve the asked problem.
Quality Control: Data Contamination Analysis
- What happens: They compare performance on problems with and without known online solutions. Differences are tiny and statistically non-significant.
- Why it matters: It suggests results reflect real reasoning, not just memorization.
- Example: Across thousands of modelâproblem pairs, mean scores and pass rates are nearly the same whether or not a solution exists online.
In short: QEDBENCH is a carefully staged recipe that starts with honest, varied problems; gathers diverse model solutions; anchors everything with expert human scoring under two rubrics; and then uses a big judgeâsolver grid to reveal exactly how, where, and why AI judges drift from real mathematical rigor.
04Experiments & Results
The Test: What did they measure and why?
- They measured how well solver models did on 272 tough proof problems, as scored by human experts. Then they measured how closely seven AI judgesâ scores matched those human scores, under both the course-specific and the expert-correctness rubrics. This tells us both solver skill and judge reliability.
The Competition: Who/what was compared?
- Solvers: Five frontier models generated proofs. Performance varied by field (Analysis, ODEs, Probability, Abstract Algebra, Discrete Math, Graph Theory, Combinatorics, etc.).
- Judges: Seven frontier evaluators graded all those proofs. Each judgeâs results were compared with human experts to reveal leniency or harshness, and domain-specific biases.
The Scoreboard (with context):
- Overall solver highlight: Gemini 3.0 Pro stood out with strong average human-evaluated scores and high pass rates across many fields, including very robust results in Graph Theory. This is like getting an A across most subjects. Meanwhile, several other strong solvers that excelled in continuous domains (like ODEs) struggled in discrete, constructive tasks (like Combinatorics), where pass rates could collapse dramatically.
- DiscreteâContinuous Divide: In smooth, template-friendly areas (e.g., ODEs, standard Complex Analysis techniques), many models achieved near-perfect pass ratesâlike consistently following a recipe. But in Combinatorics and some Graph Theory tasks that require inventing concrete objects or counterexamples, several modelsâ pass rates dropped steeplyâlike going from an A to a Dâexposing a core reasoning gap.
- Judge strictness and bias: Human experts set a baseline for who âpasses.â Some AI judges passed far more proofs than humans (grade inflation), especially in discrete fields where fancy-looking but flawed logic is common. The worst inflators were like referees who rarely call fouls; they let a lot of bad play slide. Other judges were too harsh in continuous areas, penalizing acceptable expert shortcuts that humans would allow.
Surprising Findings:
- Big Leniency in Some Judges: One judge showed extremely high false-positive behaviorâpassing a large share of flawed proofs. In plain terms, it said âgood job!â to many wrong arguments just because they looked tidy and confident.
- Punitive Behavior in Continuous Math: Other judges were stricter than humans in topics like Complex Analysis, where humans often allow implicit expert steps (e.g., standard contour integral steps). These judges demanded overly explicit details, lowering valid scores.
- Rubric Insensitivity: Switching from the expert rubric to the stricter course-specific rubric barely changed alignment for several judges. This suggests that judge behavior comes more from internal habits than from the promptâs instructions.
- The Sycophancy Trap: Judges sometimes rewarded irrelevant or even empty submissions (like formatted error messages) with partial credit simply because the text âlookedâ mathematical. They also sometimes penalized students who correctly pointed out that the problemâs statement was falseâpreferring prompt-following over mathematical truth.
- Contamination Not Driving Results: Comparing problems with known online solutions to those without showed tiny, statistically insignificant differences in human-scored performance. That means success is mostly real reasoning, not memorization.
Making Numbers Meaningful (examples):
- Think of human experts setting the pass bar. A lenient judge might pass around 9 out of 10 proofs where humans pass about 7 out of 10âlike giving out too many Aâs. This inflates solver reputation.
- For discrete problems, imagine a solver drops from a B to a D compared to its performance in ODEs. Thatâs a big red flag: the model can follow a recipe but struggles with step-by-step construction.
Key Diagnostic Metric (with example):
- Alignment Gap uses a simple difference: . For example, if a human gives and the AI gives , then . Thatâs like nearly two letter grades higher than the teacherâs score, a strong sign of inflation.
Takeaways:
- Solver ability is not one-dimensional. Being great at ODEs doesnât imply being great at Combinatorics.
- Judge reliability is now the bottleneck. Without calibrated judges, we canât trust automated pass/fail decisions for advanced proofs.
- Prompt engineering alone wonât fix judges. Better training and process-aware supervision are needed so judges focus on logic, not looks.
- QEDBENCH gives the community a public, carefully grounded way to see and reduce these gaps.
05Discussion & Limitations
Limitations:
- Single Ground Truth: Even expert rubrics can miss unusual but valid proofs. So while human consensus is a strong anchor, itâs not omniscient.
- English Focus: Problems and solutions are in English; behavior could differ across languages.
- Model Involvement: A powerful model helped draft initial rubrics (humans refined them), which could slightly advantage a similar-family judge.
- Static Snapshots: The benchmark reflects a moment in time; new models may shift the landscape and require new audits.
Required Resources:
- Expert Time: Building and maintaining a human-anchored benchmark took 1,000+ expert hours.
- Compute and APIs: Generating long proofs and running many judgeâsolver pairs uses significant tokens and timeouts, especially with streaming APIs.
- Curation Workflow: Adversarial decontamination and rubric refinement need careful pipelines.
When NOT to Use This Setup:
- If you need instant, high-stakes certification (e.g., publishable theorems), donât rely solely on LLM judges; use formal proof assistants or human review.
- If your domain is outside the ten covered areas (e.g., deep algebraic geometry), donât assume the same patterns hold without testing.
- If you canât afford human anchoring, be cautious: raw judge scores can be misleading.
Open Questions:
- Process Supervision: How can we train judges to track global logical dependencies, not just local plausibility? Can step-checkers with explicit error attribution reduce leniency?
- Discrete Constructive Skills: What training signals help solvers build or refute specific combinatorial objects reliably?
- Judge Robustness: How do we immunize judges against sycophancyârewarding style over truthâand against prompt priming or one-token adversarial triggers?
- Calibration Across Domains: Can we learn per-domain correction maps that make a single judge reliably human-aligned across Analysis, Graph Theory, etc.?
Honest Assessment: QEDBENCH convincingly shows that verification is the new frontier: models can write convincing math, but judges too often mistake confidence for correctness. The discreteâcontinuous divide reveals that âreasoningâ isnât one skillâtemplate following differs from constructive search. The dual-rubric, multi-judge design is a strong, practical step toward trustworthy evaluation, but widespread adoption will still require better-trained judges, process-aware checking, and continued human oversight.
06Conclusion & Future Work
Three-Sentence Summary:
- QEDBENCH shows that automated graders for university-level math proofs systematically diverge from human experts and often reward persuasive but flawed logicâespecially in discrete, constructive domains.
- By combining two rubrics and a comprehensive judgeâsolver matrix anchored in 1,000+ hours of expert scoring, the benchmark precisely measures where and how judges go off track.
- Results point to a clear need for process-aware, better-trained judges; prompt tweaks alone wonât fix the problem.
Main Achievement:
- The paper delivers a rigorous, public benchmarkâQEDBENCHâthat quantifies the alignment gap for advanced proofs and exposes domain-specific patterns like grade inflation in combinatorics and punitiveness in complex analysis.
Future Directions:
- Train process-supervised judges that track global logic, not just local style, and that reduce both leniency (false positives) and harshness (false negatives).
- Develop per-domain calibration so one judge can be human-aligned across Analysis, ODEs, Combinatorics, and beyond.
- Integrate formal verification tools where feasible and build hybrid pipelines that escalate uncertain cases to humans.
Why Remember This:
- Generation isnât the bottleneck anymoreâverification is. QEDBENCH gives the community a trustworthy way to see the problem clearly and to measure progress. With it, we can move from rewarding âhallucinated rigorâ to rewarding real mathematical correctnessâbringing AI grading closer to the way great teachers think.
Practical Applications
- âąCalibrate an AI judge: Measure its per-topic bias against QEDBENCH and adjust scores accordingly.
- âąSelect safer judges: Choose the evaluator with the smallest alignment gap for your math course or platform.
- âąCurriculum-aligned grading: Use the course-specific rubric mode to catch forbidden shortcuts in homework.
- âąResearch vetting: Run proofs through expert-style grading to flag hidden logical gaps before sharing results.
- âąTrain better reward models: Use human-aligned labels from QEDBENCH to reduce sycophancy and leniency.
- âąMixed humanâAI pipelines: Auto-grade easy cases and escalate uncertain ones based on measured judge reliability.
- âąContest evaluation: Benchmark judges on combinatorics/graph theory to avoid mass pass of flawed solutions.
- âąModel diagnostics: Compare solver performance across discrete vs. continuous topics to guide fine-tuning.
- âąJudge stress tests: Probe judges with irrelevant or adversarially formatted responses and measure false positives.
- âąBenchmark updates: Re-run new judges and solvers regularly to track progress and prevent regressions.