QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs

Santiago Gonzalez; Alireza Amiri Bavandpour; Peter Ye; Edward Zhang; Ruslans Aleksejevs; Todor Antić; Polina Baron; Sujeet Bhalerao; Shubhrajit Bhattacharya; Zachary Burton; John Byrne; Hyungjun Choi; Nujhat Ahmed Disha; Koppany István Encz; Yuchen Fang; Robert Joseph George; Ebrahim Ghorbani; Alan Goldfarb; Jing Guo; Meghal Gupta; Stefano Huber; Annika Kanckos; Minjung Kang; Hyun Jong Kim; Dino Lorenzini; Levi Lorenzo; Tianyi Mao; Giovanni Marzenta; Ariane M. Masuda; Lukas Mauth; Ana Mickovic; Andres Miniguano-Trujillo; Antoine Moulin; Wenqi Ni; Tomos Parry; Kevin Ren; Hossein Roodbarani; Mathieu Rundström; Manjil Saikia; Detchat Samart; Rebecca Steiner; Connor Stewart; Dhara Thakkar; Jeffrey Tse; Vasiliki Velona; Yunhai Xiang; Sibel Yalçın; Jun Yan; Ji Zeng; Arman Cohan; Quanquan C. Liu

QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs

Intermediate

Santiago Gonzalez, Alireza Amiri Bavandpour, Peter Ye et al.2/24/2026

arXiv

Key Summary

•This paper shows that when AI models grade university-level math proofs, they often disagree with human experts in systematic ways.
•The authors built QEDBENCH, a big, carefully checked set of 272 proof problems and 1,300+ model-written solutions, graded by PhD-level experts.
•They use two grading styles (a course-style rubric and an expert-correctness rubric) to see where AI judges go wrong.
•Some AI judges inflate grades a lot, passing proofs that humans would fail, especially in discrete math areas like combinatorics and graph theory.
•Gemini 3.0 Pro solved and explained many problems very well, but several other top models struggled badly on constructive, discrete problems.
•Even strict instructions in the prompt (the course-specific rubric) usually don’t fix AI judges; their built-in habits dominate (rubric insensitivity).
•A key measure, the alignment gap, shows how far an AI judge’s score is from human scores and reveals domain-specific biases.
•The study finds little evidence that data contamination drove results; success mostly reflects real reasoning, not memorization.
•The authors release QEDBENCH publicly so everyone can calibrate and improve AI judges for advanced math.
•Bottom line: generation is strong, but verification is the new bottleneck; better process-aware judging is needed.

Why This Research Matters

If AI judges over-reward pretty but wrong math, students and researchers will learn the wrong lessons. QEDBENCH shows exactly where today’s judges go off track, so we can fix them. It helps teachers, contest organizers, and tool builders know when to trust automated grading and when to ask a human or a formal prover. It also protects future AI training from bad reward signals that favor style over truth. As AI increasingly participates in STEM education and research, reliable evaluation is as important as solving. QEDBENCH gives the community a public, expert-anchored way to measure and improve that reliability.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how teachers don’t just check if your final answer is right, they also check if your steps make sense? For hard math, the steps matter even more than the final box.

🥬 The Concept (Proof-Based Mathematics Evaluation): What it is: It’s grading math work by checking if every step in a proof follows logically, not just whether the last line matches the answer. How it works (recipe): 1) Read the claim to be proved. 2) Check each step for correct definitions and logic. 3) Make sure no hidden assumptions sneak in. 4) Confirm the conclusion truly follows from the steps. Why it matters: Without careful step-checking, someone could write a fancy-looking proof that sounds smart but actually skips a key step, and it would still get full credit. 🍞 Anchor: For example, saying “All even numbers are divisible by 2; 10 is even; so 10 is divisible by 2” is a valid tiny proof. But “10 is divisible by 3 because 10 is near 9” looks neat but fails logically.

The World Before: Large language models (LLMs) got really good at basic math problems and even many competition-style questions. People started asking LLMs to judge other LLMs’ math—like robot teachers grading robot students—because it’s faster and cheaper than asking humans each time. This method is called LLM-as-a-Judge.

🍞 Hook: Imagine a robot referee calling a basketball game. If it can’t tell what a foul really is, it might blow the whistle at the wrong times.

🥬 The Concept (LLM-as-a-Judge): What it is: An AI reads a student’s proof and assigns a score, acting like a grader. How it works: 1) Feed the proof and a grading guide (rubric) to the AI. 2) The AI checks for correctness and completeness. 3) It outputs a score and (optionally) feedback. Why it matters: If the AI judge is accurate, we can scale grading to thousands of proofs quickly. If it’s not, we’ll reward wrong reasoning and punish correct work. 🍞 Anchor: Think of a spelling bee. An AI judge listens and decides “correct” or “wrong.” If it misunderstands simple rules like silent letters, it can’t be trusted to pick the winner.

The Problem: When judging tough, university-level proofs, LLM judges often give scores that don’t line up with human experts. They’re especially fooled by proofs that look good but have hidden mistakes. This makes it hard for researchers and teachers to know which solver models truly understand math.

Failed Attempts: Two main paths existed.

Autoformalization into proof assistants (like Lean) gives rock-solid checking but is slow and expensive to scale—turning each problem and solution into formal code takes tons of expert time.
Plain LLM-as-a-Judge scales easily but can be unreliable: it may reward stylish writing or familiar templates over real logic.

The Gap: What was missing was a way to measure, precisely and fairly, how far AI judges are from human experts, and to understand where and why they drift.

🍞 Hook: Imagine using two thermometers to check a fever—one for school nurses and one for hospital doctors. If they disagree a lot, you need to know by how much and in what situations.

🥬 The Concept (Alignment Gap): What it is: The mismatch between an AI judge’s score and what human experts would give. How it works: 1) Collect human expert scores. 2) Collect AI judge scores on the same proofs. 3) Compare them to see consistent differences. Why it matters: If AI judges inflate grades by a lot, students and researchers will think weak proofs are strong. 🍞 Anchor: If a human gives a proof 70/100 but the AI gives 90/100 again and again, you’ve got an alignment gap that can mislead everyone.

Real Stakes:

In classrooms, unfair grading hurts learning and trust.
In research, false confidence in AI proofs wastes time and can spread errors.
In training new AI systems, bad judge scores teach models to sound persuasive instead of being right.

This paper’s answer: QEDBENCH. It’s a large, carefully curated benchmark of 272 advanced proof problems (and 1,300+ model-generated solutions) that includes 1,000+ hours of human expert grading. It measures exactly how AI judges differ from experts and shows where models shine or struggle—especially the big difference between smooth, “continuous” math (like calculus and ODEs) and step-by-step, “discrete” math (like combinatorics and graph theory).

🍞 Hook: Picture two kinds of mazes. One is a straight tunnel (continuous); you just keep moving forward. The other is a grid with lots of turns (discrete); you must choose your path carefully.

🥬 The Concept (Continuous vs. Discrete Reasoning): What it is: Two flavors of math thinking—smooth, formula-based vs. combinatorial, step-building. How it works: 1) Continuous often follows known templates and standard theorems. 2) Discrete often needs constructing exact examples or clever counterexamples. Why it matters: AI models often do great with templates but stumble when they must invent a specific structure step by step. 🍞 Anchor: Using the quadratic formula (template) is continuous-like. Building a special graph that breaks a rule (construction) is discrete-like.

By telling this whole story—from what used to work, to what breaks, to what QEDBENCH measures—the paper shows that verification (grading correctly) is now the main bottleneck for AI math, not generation (writing a proof-like text).

02Core Idea

Aha! Moment in one sentence: If we want trustworthy, automated grading of university-level math proofs, we must measure and fix the systematic difference between AI judges and human experts, using two rubrics and a cross-checking matrix that reveals where the misalignment lives.

🍞 Hook: You know how getting two opinions from doctors helps you understand a health problem better?

🥬 The Concept (Dual-Rubric Strategy): What it is: Grading each proof two ways—(1) like a course instructor (course-specific) and (2) like a research expert (expert-correctness)—to spot different kinds of mistakes. How it works: 1) Build a clear course-style rubric (explicit definitions and steps). 2) Build an expert-style rubric (logical soundness; implicit easy steps allowed). 3) Have humans refine and use these. 4) Have AI judges grade using both. Why it matters: A single rubric can hide problems. Two rubrics reveal if judges are missing student-style rigor or expert-level soundness. 🍞 Anchor: It’s like grading a lab report once for neatness and following instructions, and once for scientific correctness.

🍞 Hook: Imagine six different coaches all judging the same ice-skating routine; if they consistently score higher or lower than the head referee, you learn each coach’s bias.

🥬 The Concept (Evaluator–Solver Matrix): What it is: A grid that compares many AI judges to many AI solvers on the same problems. How it works: 1) Generate solutions from several solver models. 2) Grade them with several judge models. 3) Compare all judge scores to human expert scores across the grid. Why it matters: This isolates judge bias (how a judge tends to score) from solver skill (how good a solution is), giving a fairer picture. 🍞 Anchor: If one judge always gives +10 points regardless of performance, you’ll see that in the matrix and not blame the skater.

🍞 Hook: Think of trying on different pairs of glasses. Some make the world look brighter than it really is; some make it darker.

🥬 The Concept (Alignment Gap): What it is: The difference between AI and human scores for the same proof. How it works: 1) Take the human score as ground truth. 2) Compute the AI’s score difference. 3) Aggregate differences by topic (e.g., Analysis, Graph Theory) to find patterns. Why it matters: It tells you exactly where AI judges are too lenient (pass bad proofs) or too harsh (fail good ones). 🍞 Anchor: If humans give 75 on average and the AI gives 90 on average in combinatorics, the AI is likely inflating grades there.

Multiple Analogies for the Core Idea:

Two Thermometers: One measures skin temperature (course rubric), one measures core temperature (expert rubric). Using both tells you if the patient (the proof) is truly healthy.
Airport Security: One scanner checks liquids and laptops (course rules), the other scans for risky patterns (expert correctness). Together they catch more problems.
Double-Lens Camera: A wide lens (course rubric) checks visible steps, and a zoom lens (expert rubric) checks subtle logical connections.

Before vs. After:

Before: People trusted single-shot AI grading and broad benchmarks; they didn’t see where judges over-rewarded stylish but broken logic.
After: With QEDBENCH, we see domain-specific biases (e.g., grade inflation in combinatorics) and the big discrete–continuous split. Now we know automated grading needs calibration and process-aware checks.

Why It Works (intuition, not equations): Using two rubrics prevents a single viewpoint from hiding weaknesses. Crossing multiple solvers with multiple judges prevents blaming the wrong side. Human expert scoring provides a strong anchor so we can detect consistent judge drift. This triangulation (two rubrics + many judges + many solvers) reveals not just noise, but patterns of bias.

Building Blocks of the Core Idea:

Carefully curated proof problems that demand real reasoning, not just final answers.
Human expert scoring on a fine-grained scale (so “almost correct” differs from “looks nice but wrong”).
Dual rubrics to separate classroom rigor from expert-level soundness.
A $7×5$ matrix ( $judges × solvers$ ) so alignment patterns can’t hide.
Analysis across ten math areas to find where judges reliably misgrade.

🍞 Hook: Think of a radio that’s too loud in one station and too quiet in another; you need to adjust the knob differently depending on the channel.

🥬 The Concept (Statistical Calibration): What it is: Adjusting how we interpret AI judge scores by accounting for their typical bias against expert ground truth. How it works: 1) Measure each judge’s average over/under-scoring in each domain. 2) Use these gaps to re-interpret future scores. 3) Prefer judges with smaller, more stable gaps. Why it matters: Without calibration, a flashy but lenient judge misleads us about solver quality. 🍞 Anchor: If Judge A is usually +10 points in Graph Theory, a 90 from A might mean “really an 80” compared to humans.

Finally, QEDBENCH also introduces two important diagnoses:

Sycophancy Trap: Some judges reward authoritative-sounding math even when it’s irrelevant or wrong.
Rubric Insensitivity: Even when you tell a judge to be stricter (course rubric), its internal habits barely change—showing prompts alone won’t fix judging. These findings point toward process supervision and better training for future AI judges.

03Methodology

At a high level: Problems → [Solutions from multiple solver models] → [Human expert scores (ground truth) using refined rubrics] → [AI judges grade the same solutions under both rubrics] → [Compare AI vs. human to measure alignment gaps and biases].

Step 1: Problem Curation and Anti-Leakage

What happens: The team gathered 272 upper-undergrad/early-grad proof problems across ten fields (like Analysis, Abstract Algebra, Combinatorics, Graph Theory, ODEs). They rewrote problems to avoid simple memorization, and they ran an adversarial web search to flag problems with existing solutions online.
Why this step exists: If models just memorize old solutions, you’re not testing reasoning. You need fresh wording and careful checks.
Example: Suppose a classic Graph Theory problem asks for a path construction. They rephrase it but keep the same mathematical heart, then scan the web to ensure no identical solution exists. If a match appears, it’s tagged—later analysis shows that even with some online matches, performance differences were statistically tiny, suggesting real reasoning drove success.

🍞 Hook: Imagine giving your friend a puzzle and first making sure they’ve never seen that exact puzzle before.

🥬 The Concept (Adversarial Data Curation): What it is: Designing and checking problems so that solving them needs genuine reasoning, not recall. How it works: 1) Source high-quality problems. 2) Reword them carefully. 3) Use automated deep search to detect close matches online. 4) Tag or remove suspicious items. Why it matters: This keeps the benchmark honest—wins come from thinking, not remembering. 🍞 Anchor: If a puzzle is reworded and no matching solution is online, a solver’s success likely reflects real skill.

Step 2: Solution Generation (5 Solvers)

What happens: Five cutting-edge solver models each attempt all problems, with generous token limits and retries if needed. If a model repeatedly fails to produce a proof (e.g., timeout), that attempt gets a 0, reflecting real-world deployment.
Why this step exists: You need a diverse pool of solutions—good, bad, and in-between—to fairly test judges.
Example: One solver writes a long Analysis proof that looks stylish but mixes up uniform and pointwise convergence. Another solver for Combinatorics tries to construct a bijection but leaves a gap. These varied attempts are perfect for testing how well judges catch subtle mistakes.

Step 3: Human Expert Annotation & Rubric Refinement

What happens: 48 experts (PhD holders/candidates) score solutions using a fine-grained scale (like 0, 0.25, 0.5, 0.75, 0.9, 1.0). They use two types of rubrics: course-specific (pedagogical constraints) and expert-correctness. Drafts of rubrics were generated by a model, then iteratively refined by humans until they matched expert standards.
Why this step exists: Human scores become the ground truth. The two rubrics let us separate “followed course rules” from “is logically sound at expert level.”
Example: A proof that’s basically correct but skips a small edge case might earn 0.9. A proof that looks nice but hides a fatal logical gap might get 0.5.

🍞 Hook: Think of coaches marking exactly where a gymnast wobbled on the beam, not just saying “good” or “bad.”

🥬 The Concept (Expert Annotation): What it is: Detailed human grading that highlights the precise strengths and mistakes. How it works: 1) Experts score using a clear scale. 2) They give justifications and mark error locations. 3) They refine rubrics so scores match intended standards. Why it matters: Without a trustworthy human anchor, you can’t measure how far AI judges drift. 🍞 Anchor: If three experts say “this step is circular,” we can compare whether AI judges notice or ignore that same flaw.

Step 4: Judge Calibration with a $7×5$ Matrix

What happens: Seven judge models grade all solver outputs under both rubrics. Now we can compare each judge’s scores to human scores, across many solvers and topics.
Why this step exists: To separate judge behavior (lenient/harsh, domain biases) from solver skill and to measure reliability.
Example: One judge might pass nearly everything in Combinatorics (too lenient), while another unfairly fails valid Complex Analysis proofs that skip trivial expert steps (too harsh).

Secret Sauce: Measuring the Alignment Gap and Bias Patterns

Core quantity: the alignment gap between AI and human scores. The paper defines bias per domain and per judge, making it easy to see systematic inflation or punitiveness.
Simple formula with example: The gap can be summarized by the difference $\Delta = S_{AI} - S_{Human}$ . For example, with $S_{AI}=0.90$ and $S_{Human}=0.75$ , $\Delta = 0.90 - 0.75 = 0.15$ .
Why it’s clever: Instead of arguing over single examples, we see stable, repeatable judge patterns across many problems—like a fingerprint of each judge’s habits.

Rubric Insensitivity Check

What happens: They test whether stricter prompts (course-specific rubric that penalizes forbidden shortcuts) actually change a judge’s alignment. Surprisingly, alignment metrics barely move; judge habits dominate the prompt.
Why it matters: This shows you can’t fix judging with prompt tweaks alone. You likely need process supervision or fine-tuning on better signals.
Example: Even when told “penalize advanced theorems used without derivation,” a judge still gives similar scores, missing the point that the student skipped key course steps.

Error Mode Diagnostics

Leniency (False Positives): Passing broken proofs because they look plausible.
Harshness (False Negatives): Failing valid proofs because they don’t match a familiar template.
Sycophancy Trap: Rewarding authoritative-sounding LaTeX or even irrelevant-but-correct math blocks that don’t solve the asked problem.

Quality Control: Data Contamination Analysis

What happens: They compare performance on problems with and without known online solutions. Differences are tiny and statistically non-significant.
Why it matters: It suggests results reflect real reasoning, not just memorization.
Example: Across thousands of model–problem pairs, mean scores and pass rates are nearly the same whether or not a solution exists online.

In short: QEDBENCH is a carefully staged recipe that starts with honest, varied problems; gathers diverse model solutions; anchors everything with expert human scoring under two rubrics; and then uses a big judge–solver grid to reveal exactly how, where, and why AI judges drift from real mathematical rigor.

04Experiments & Results

The Test: What did they measure and why?

They measured how well solver models did on 272 tough proof problems, as scored by human experts. Then they measured how closely seven AI judges’ scores matched those human scores, under both the course-specific and the expert-correctness rubrics. This tells us both solver skill and judge reliability.

The Competition: Who/what was compared?

Solvers: Five frontier models generated proofs. Performance varied by field (Analysis, ODEs, Probability, Abstract Algebra, Discrete Math, Graph Theory, Combinatorics, etc.).
Judges: Seven frontier evaluators graded all those proofs. Each judge’s results were compared with human experts to reveal leniency or harshness, and domain-specific biases.

The Scoreboard (with context):

Overall solver highlight: Gemini 3.0 Pro stood out with strong average human-evaluated scores and high pass rates across many fields, including very robust results in Graph Theory. This is like getting an A across most subjects. Meanwhile, several other strong solvers that excelled in continuous domains (like ODEs) struggled in discrete, constructive tasks (like Combinatorics), where pass rates could collapse dramatically.
Discrete–Continuous Divide: In smooth, template-friendly areas (e.g., ODEs, standard Complex Analysis techniques), many models achieved near-perfect pass rates—like consistently following a recipe. But in Combinatorics and some Graph Theory tasks that require inventing concrete objects or counterexamples, several models’ pass rates dropped steeply—like going from an A to a D—exposing a core reasoning gap.
Judge strictness and bias: Human experts set a baseline for who “passes.” Some AI judges passed far more proofs than humans (grade inflation), especially in discrete fields where fancy-looking but flawed logic is common. The worst inflators were like referees who rarely call fouls; they let a lot of bad play slide. Other judges were too harsh in continuous areas, penalizing acceptable expert shortcuts that humans would allow.

Surprising Findings:

Big Leniency in Some Judges: One judge showed extremely high false-positive behavior—passing a large share of flawed proofs. In plain terms, it said “good job!” to many wrong arguments just because they looked tidy and confident.
Punitive Behavior in Continuous Math: Other judges were stricter than humans in topics like Complex Analysis, where humans often allow implicit expert steps (e.g., standard contour integral steps). These judges demanded overly explicit details, lowering valid scores.
Rubric Insensitivity: Switching from the expert rubric to the stricter course-specific rubric barely changed alignment for several judges. This suggests that judge behavior comes more from internal habits than from the prompt’s instructions.
The Sycophancy Trap: Judges sometimes rewarded irrelevant or even empty submissions (like formatted error messages) with partial credit simply because the text “looked” mathematical. They also sometimes penalized students who correctly pointed out that the problem’s statement was false—preferring prompt-following over mathematical truth.
Contamination Not Driving Results: Comparing problems with known online solutions to those without showed tiny, statistically insignificant differences in human-scored performance. That means success is mostly real reasoning, not memorization.

Making Numbers Meaningful (examples):

Think of human experts setting the pass bar. A lenient judge might pass around 9 out of 10 proofs where humans pass about 7 out of 10—like giving out too many A’s. This inflates solver reputation.
For discrete problems, imagine a solver drops from a B to a D compared to its performance in ODEs. That’s a big red flag: the model can follow a recipe but struggles with step-by-step construction.

Key Diagnostic Metric (with example):

Alignment Gap uses a simple difference: $\Delta = S_{AI} - S_{Human}$ . For example, if a human gives $S_{Human}=0.70$ and the AI gives $S_{AI}=0.88$ , then $\Delta=0.88-0.70=0.18$ . That’s like nearly two letter grades higher than the teacher’s score, a strong sign of inflation.

Takeaways:

Solver ability is not one-dimensional. Being great at ODEs doesn’t imply being great at Combinatorics.
Judge reliability is now the bottleneck. Without calibrated judges, we can’t trust automated pass/fail decisions for advanced proofs.
Prompt engineering alone won’t fix judges. Better training and process-aware supervision are needed so judges focus on logic, not looks.
QEDBENCH gives the community a public, carefully grounded way to see and reduce these gaps.

05Discussion & Limitations

Limitations:

Single Ground Truth: Even expert rubrics can miss unusual but valid proofs. So while human consensus is a strong anchor, it’s not omniscient.
English Focus: Problems and solutions are in English; behavior could differ across languages.
Model Involvement: A powerful model helped draft initial rubrics (humans refined them), which could slightly advantage a similar-family judge.
Static Snapshots: The benchmark reflects a moment in time; new models may shift the landscape and require new audits.

Required Resources:

Expert Time: Building and maintaining a human-anchored benchmark took 1,000+ expert hours.
Compute and APIs: Generating long proofs and running many judge–solver pairs uses significant tokens and timeouts, especially with streaming APIs.
Curation Workflow: Adversarial decontamination and rubric refinement need careful pipelines.

When NOT to Use This Setup:

If you need instant, high-stakes certification (e.g., publishable theorems), don’t rely solely on LLM judges; use formal proof assistants or human review.
If your domain is outside the ten covered areas (e.g., deep algebraic geometry), don’t assume the same patterns hold without testing.
If you can’t afford human anchoring, be cautious: raw judge scores can be misleading.

Open Questions:

Process Supervision: How can we train judges to track global logical dependencies, not just local plausibility? Can step-checkers with explicit error attribution reduce leniency?
Discrete Constructive Skills: What training signals help solvers build or refute specific combinatorial objects reliably?
Judge Robustness: How do we immunize judges against sycophancy—rewarding style over truth—and against prompt priming or one-token adversarial triggers?
Calibration Across Domains: Can we learn per-domain correction maps that make a single judge reliably human-aligned across Analysis, Graph Theory, etc.?

Honest Assessment: QEDBENCH convincingly shows that verification is the new frontier: models can write convincing math, but judges too often mistake confidence for correctness. The discrete–continuous divide reveals that “reasoning” isn’t one skill—template following differs from constructive search. The dual-rubric, multi-judge design is a strong, practical step toward trustworthy evaluation, but widespread adoption will still require better-trained judges, process-aware checking, and continued human oversight.

06Conclusion & Future Work

Three-Sentence Summary:

QEDBENCH shows that automated graders for university-level math proofs systematically diverge from human experts and often reward persuasive but flawed logic—especially in discrete, constructive domains.
By combining two rubrics and a comprehensive judge–solver matrix anchored in 1,000+ hours of expert scoring, the benchmark precisely measures where and how judges go off track.
Results point to a clear need for process-aware, better-trained judges; prompt tweaks alone won’t fix the problem.

Main Achievement:

The paper delivers a rigorous, public benchmark—QEDBENCH—that quantifies the alignment gap for advanced proofs and exposes domain-specific patterns like grade inflation in combinatorics and punitiveness in complex analysis.

Future Directions:

Train process-supervised judges that track global logic, not just local style, and that reduce both leniency (false positives) and harshness (false negatives).
Develop per-domain calibration so one judge can be human-aligned across Analysis, ODEs, Combinatorics, and beyond.
Integrate formal verification tools where feasible and build hybrid pipelines that escalate uncertain cases to humans.

Why Remember This:

Generation isn’t the bottleneck anymore—verification is. QEDBENCH gives the community a trustworthy way to see the problem clearly and to measure progress. With it, we can move from rewarding “hallucinated rigor” to rewarding real mathematical correctness—bringing AI grading closer to the way great teachers think.

Practical Applications

•Calibrate an AI judge: Measure its per-topic bias against QEDBENCH and adjust scores accordingly.
•Select safer judges: Choose the evaluator with the smallest alignment gap for your math course or platform.
•Curriculum-aligned grading: Use the course-specific rubric mode to catch forbidden shortcuts in homework.
•Research vetting: Run proofs through expert-style grading to flag hidden logical gaps before sharing results.
•Train better reward models: Use human-aligned labels from QEDBENCH to reduce sycophancy and leniency.
•Mixed human–AI pipelines: Auto-grade easy cases and escalate uncertain ones based on measured judge reliability.
•Contest evaluation: Benchmark judges on combinatorics/graph theory to avoid mass pass of flawed solutions.
•Model diagnostics: Compare solver performance across discrete vs. continuous topics to guide fine-tuning.
•Judge stress tests: Probe judges with irrelevant or adversarially formatted responses and measure false positives.
•Benchmark updates: Re-run new judges and solvers regularly to track progress and prevent regressions.

Version: 1