Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

Chongyang Gao; Diji Yang; Shuyan Zhou; Xichen Yan; Luchuan Song; Shuo Li; Kezhen Chen

Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

Intermediate

Chongyang Gao, Diji Yang, Shuyan Zhou et al.2/23/2026

arXiv

Key Summary

•CFE-BENCH is a new, teacher-verified "Classroom Final Exam" for AI that uses real college STEM problems to test deep, step-by-step reasoning.
•It checks answers using a strict variable-based verification system so models can’t pass just by writing long, convincing explanations.
•The benchmark covers text and images (multimodal) across 20+ subjects, including physics, math, and several engineering fields.
•Even top models struggle: the best model reached only $59.69\%$ question accuracy, showing lots of room to improve.
•A step-by-step diagnostic shows that models can usually do single steps when told exactly what to compute, but they fail to keep intermediate results correct in long chains.
•Injecting just one correct intermediate answer often helps almost as much as giving many hint-questions, proving that correct middle steps are the real key.
•Models take more steps than instructors to reach answers, so their reasoning is less efficient and creates more chances for mistakes.
•The evaluation protocol (S2S) reduces false positives compared to long-answer matching (L2L), aligning better with human expert grading.
•CFE-BENCH provides a realistic, unsaturated testbed to guide training methods that verify and reward correct intermediate steps, not just final answers.

Why This Research Matters

In real life, science and engineering hinge on getting the middle steps right, not just telling a nice story at the end. CFE-BENCH pushes AI to be dependable in exactly those middle steps, using teacher-verified problems and strict variable checks. This means better AI for tutoring, homework help, lab analysis, and design tasks where a single wrong intermediate value can cause failure. It also promotes training methods that reward step-by-step correctness and shorter, cleaner solutions, which are safer and easier to trust. As AI enters classrooms and workplaces, this benchmark guides improvements that make models genuinely useful, not just persuasive. It highlights where to focus: compute and preserve correct intermediate results. The result is AI that’s a steadier teammate for STEM learning and problem-solving.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how a good school test doesn’t just ask for the final answer, but also checks how you got there? That’s because getting the middle steps right shows you really understand the problem, not just the ending.

🥬 Filling (The Actual Concept)

What it is: Before this paper, many AI tests (benchmarks) were either getting too easy for top models or didn’t carefully check the middle steps of reasoning.
How it works (the world before):
1. We had many benchmarks where models picked choices (A/B/C/D) or wrote long explanations.
2. Judges often compared the whole explanation to a reference solution.
3. If the explanation sounded good, it could be marked correct even if the final values were wrong.
Why it matters: Without precise checks, AI can look better at STEM reasoning than it truly is, which can be risky for science, engineering, and education.

🍞 Bottom Bread (Anchor) Imagine grading a math test by skimming a student’s paragraph and saying, “Seems right.” That would miss if they computed $7\times 8$ as $54$ instead of $56$ . We need exact checks, not vibes.

🍞 Top Bread (Hook) Think of building a LEGO model: if one brick in the middle is wrong, the top might wobble or fall, even if the first and last bricks look fine.

🥬 Filling (New Concept: Multi-step reasoning)

What it is: Multi-step reasoning is solving a problem through a chain of smaller steps, where each step feeds into the next.
How it works:
1. Break the big problem into smaller steps.
2. Solve each step carefully and record the result.
3. Use previous results to advance until the final answer is reached.
Why it matters: If one middle result is wrong, every later step can go off-track.

🍞 Bottom Bread (Anchor) In physics, if you first compute the force wrong, your acceleration and time answers will be wrong too.

🍞 Top Bread (Hook) You know how some homework needs both reading the text and looking at a diagram? You can’t solve it with text only.

🥬 Filling (New Concept: Multimodal benchmark)

What it is: A multimodal benchmark tests AI on questions that use different types of inputs, like text and images.
How it works:
1. Include problems that require diagrams, charts, or circuit schematics plus text.
2. Ensure the image actually matters and can change the answer.
3. Test whether the AI combines what it reads with what it sees.
Why it matters: Real STEM problems often need both words and visuals. Ignoring images gives an unreal picture of ability.

🍞 Bottom Bread (Anchor) A free-body diagram in mechanics can be essential; without seeing it, you might miss a force and get the wrong answer.

🍞 Top Bread (Hook) If two students say the same correct number in different ways (like 0.5 and 1/2), a fair teacher still gives full credit.

🥬 Filling (New Concept: Benchmark saturation and judging problems)

What it is: Many existing tests are close to maxed out by frontier AI, and judging by long explanations can be fooled.
How it works:
1. Models memorize patterns or write fluent text that “feels” correct.
2. Long-to-long comparison can miss wrong final values hidden inside nice writing.
3. This creates false positives: solutions marked correct when they aren’t.
Why it matters: If we think models are better at reasoning than they are, we may trust them too much in critical situations.

🍞 Bottom Bread (Anchor) A calculator with a cracked screen might seem okay—until you notice it keeps giving slightly wrong results. Smooth talk can hide small math mistakes too.

Given this, the paper creates CFE-BENCH, a new, instructor-tested benchmark sourced from real college STEM coursework, and pairs it with a stricter way to check answers so we reward true reasoning, not just nice words.

02Core Idea

🍞 Top Bread (Hook) Imagine a science fair where judges don’t just admire the poster’s neatness—they check each measurement, calculation, and the final conclusion.

🥬 Filling (New Concept: CFE-BENCH)

What it is: CFE-BENCH is a new, realistic “Classroom Final Exam” for AI that uses real university-level STEM problems and checks the exact target values or formulas the answer must contain.
How it works:
1. Collect real, instructor-verified problems (text and images) from many STEM fields.
2. For each problem, annotate the exact variables the final answer must include (like x, y, or a formula) and their ground-truth values.
3. Grade model responses by first extracting those variables from the response, then verifying them against the truth.
Why it matters: This avoids being fooled by long, nice-sounding explanations and focuses on whether the model produced the right, checkable results.

🍞 Bottom Bread (Anchor) A control systems question might require the exact transfer function. CFE-BENCH checks whether the function the model gives is mathematically the same as the instructor’s—not just whether the essay sounds smart.

🍞 Top Bread (Hook) You know how a treasure map has checkpoints? If you reach each checkpoint correctly, you’ll find the treasure; miss one and you get lost.

🥬 Filling (New Concept: Reasoning flow)

What it is: A reasoning flow is the ordered list of small, verifiable steps used to solve a problem.
How it works:
1. Break the instructor’s solution into bite-sized sub-questions and their correct answers.
2. Test whether a model can do each step when asked directly (atomic skill).
3. Test if it can chain steps without drifting off (composition skill).
Why it matters: It reveals whether the model fails because it can’t do a single step, or because it loses track over many steps.

🍞 Bottom Bread (Anchor) In a circuits problem: Step 1—identify the components; Step 2—write KCL; Step 3—solve for current; Step 4—compute power. Each step is checked separately.

🍞 Top Bread (Hook) Think of baking: getting the cake right depends on intermediate steps—mixing, measuring, preheating—not just the frosting at the end.

🥬 Filling (New Concept: Intermediate states)

What it is: Intermediate states are the middle results (numbers or expressions) you carry from one step to the next.
How it works:
1. Compute a middle value (like an acceleration).
2. Use it to find the next value (like velocity).
3. Keep using, without changing it accidentally, until the end.
Why it matters: If an intermediate value is wrong or drifts, the final answer collapses.

🍞 Bottom Bread (Anchor) If you miscompute the slope in a calculus problem, all later integrals and evaluations go wrong.

🍞 Top Bread (Hook) If a coach wants to know who won a race, they look at the clocked times, not just how confident the runner sounded.

🥬 Filling (New Concept: Variable-based verification protocol)

What it is: A strict grading method that extracts the specific target variables from a model’s answer and directly compares them to ground truth.
How it works:
1. Annotate variables with names, types (numeric, formula, other), and descriptions.
2. Use a judge to extract the model’s values for those variables.
3. Verify equivalence to the ground truth (numeric and algebraic).
Why it matters: It cuts false positives from long, fluent text that still hides the wrong numbers.

🍞 Bottom Bread (Anchor) If the ground truth says the minimum distance is $L\sqrt{\tfrac{m}{M}}$ and the model says $\tfrac{L\sqrt{m}}{\sqrt{M}}$ , the protocol accepts it as equivalent. If it says $L\tfrac{m}{M}$ , it’s marked wrong.

Multiple Analogies for the Key Insight

School Locker Combo: You must enter each number exactly in order (intermediate states). If one number’s off, the lock won’t open.
GPS Directions: Missing one turn (a critical step) sends you far off route—even if you follow later steps perfectly.
Domino Run: If a key domino (single unit) doesn’t fall correctly, the whole chain fails; tipping that one fixes most of the run.

Before vs After

Before: AI could look great by writing long explanations or by acing multiple-choice tests; real multi-step reliability stayed hidden.
After: With CFE-BENCH and variable checks, we see that models often know the steps in isolation but can’t maintain correct intermediate values across long chains.

Why It Works (Intuition)

Checking the exact target variables focuses evaluation on what’s undeniably right or wrong.
Decomposing solutions into reasoning flows pinpoints whether failures are atomic (one step) or compositional (linking steps).
Testing single-unit injections shows that correct intermediate answers unlock downstream reasoning—so middle steps matter most.

Building Blocks

Real classroom problems (authenticity and subject breadth).
Variable-based verification (precise, robust grading).
Reasoning flow decomposition (stepwise diagnostics).
Unit execution tests (atomic skills).
Prefix vs single-unit conditioning (what kind of help truly helps).

03Methodology

Overview (At a high level) Input (real STEM questions) → Collection and filtering → Expert annotation (variables and reasoning flow) → Model answering (chain-of-thought) → Variable extraction (S2S) → Verification (numeric/formula equivalence) → Diagnostics (unit execution, prefixes, injections) → Output (accuracies and insights).

🍞 Top Bread (Hook) Imagine organizing a school science decathlon: you pick fair events, write clear rules, decide how to score, and then analyze who did well and why.

🥬 Filling (New Concept: How CFE-BENCH is built and scored)

What it is: A pipeline that gathers real problems, makes them clear and testable, and scores models on exact variables.
How it works, step by step:
1. Collection
  - Gather problems from real university courses (exams, quizzes, homework) across 20+ STEM subjects.
  - Clean text, standardize symbols and units, and remove duplicates.
  - Keep only problems that require non-trivial multi-step reasoning and have objectively checkable targets.
2. Expert Review and Annotation
  - Graduate-level annotators verify clarity, difficulty, and evaluability.
  - Each problem gets a set of ground-truth target variables: name, description, type (numeric/formula/other), and value.
  - For multimodal items, confirm that the image truly affects the answer.
3. Reasoning Flow Construction
  - Decompose instructor solutions into ordered, verifiable units: a sub-question and its answer per step.
  - Ensure each step depends only on the question and prior steps.
4. Model Inference
  - Ask models to solve with chain-of-thought enabled; keep decoding settings consistent and reasonable.
5. Variable Extraction (S2S)
  - Use a judge to pull just the required variables from the model’s response.
  - Compare against ground truth with numeric tolerance and algebraic equivalence.
6. Metrics
  - Variable Accuracy (partial credit across variables).
  - Question Accuracy (all-or-nothing by question).
7. Diagnostics
  - Unit execution tests isolate single-step competence.
  - Reasoning prefixes test how much given context helps.
  - Single-unit injections test how much one correct intermediate answer helps.
Why it matters: This pipeline creates a fair, realistic, and sharp test that reveals true reasoning strengths and weaknesses.

🍞 Bottom Bread (Anchor) It’s like a lab practical: you collect real tasks, write rubrics for each measurement, then check not only the final result but whether each setup and calculation step is correct.

Key Formulas and Examples

Variable Accuracy $VA = \frac{1}{N} \sum_{j=1}^{N} \frac{c_j}{n_j}$ . Example: If $N=2$ questions, with $(c_1=2, n_1=4)$ and $(c_2=3, n_2=3)$ , then $VA=\tfrac{1}{2}\left(\tfrac{2}{4}+\tfrac{3}{3}\right)=\tfrac{1}{2}(0.5+1)=0.75$ .
Question Accuracy $QA = \frac{\#\,\text{correct questions}}{N}$ . Example: If $N=10$ and $6$ questions have all variables correct, then $QA=\tfrac{6}{10}=0.6$ .
Unit Execution Accuracy at step $i$ $UEA(i) = \frac{\#\,\text{correct at step }i}{\#\,\text{evaluations at step }i}$ . Example: If step $i$ was tested $50$ times and $42$ were correct, $UEA(i)=\tfrac{42}{50}=0.84$ .
Step Inflation (inefficiency) $r = \frac{\ell_{\text{model}} - \ell_{\text{GT}}}{\ell_{\text{GT}}}$ . Example: If model length is $12.20$ and ground-truth is $10.73$ , then $r=\tfrac{12.20-10.73}{10.73}\approx\tfrac{1.47}{10.73}\approx0.137$ (about $13.7\%$ longer).

🍞 Top Bread (Hook) You know how hints work better when the hint gives a needed fact, not just says “think about step 3”?

🥬 Filling (New Concepts: Prefix vs. Single-Unit Injection)

What they are: Two ways to help the model during diagnostics.
How they work:
1. Reasoning Prefix: Provide a chunk of earlier steps (with or without answers) before asking for the final answer.
2. Single-Unit Injection: Provide just one step (especially with its correct answer) and see the boost.
Why it matters: It shows whether models mainly need the structure (questions) or the concrete, correct intermediate values.

🍞 Bottom Bread (Anchor) If you tell a student “Use conservation of energy and here is the potential at this point,” their chance of finishing correctly often jumps more than if you just say “Try thinking about energy.”

Secret Sauce (What makes it clever)

Teacher-grade answers: Real, re-used classroom problems and verified solutions.
Tight grading: Extract-and-verify target variables to reduce false positives.
X-ray view of thinking: Reasoning flows let us test atomic skill versus chaining skill.
Minimal yet mighty hints: One correct intermediate value can rescue the whole chain.
Efficiency lens: Counting steps detects wasteful reasoning paths that cause error cascades.

04Experiments & Results

🍞 Top Bread (Hook) Imagine two teams taking the same tough exam. We don’t just look at final grades; we also check who handled diagrams correctly, who kept their calculations straight, and who needed extra steps to finish.

🥬 Filling (New Concept: The Test)

What it is: Models are tested on text-only and multimodal (text+image) STEM problems using strict variable-based grading.
How it works:
1. Each problem has annotated target variables with types and ground-truth values.
2. A judge extracts the model’s claimed values and verifies equivalence.
3. We compute Variable Accuracy (partial progress) and Question Accuracy (all-or-nothing).
Why it matters: This gives a fair, fine-grained scoreboard, not just a vibe-based pass.

🍞 Bottom Bread (Anchor) A control-systems problem demands the exact transfer function form; a physics problem demands the precise symbol expression for a distance. Both are scored against the correct variables.

The Competition (Who/what)

Open-weight models like Qwen 3.5 and DeepSeek.
Proprietary models like Gemini, GPT-5.2, Claude, and Grok.

Scoreboard with Context

On the combined text+multimodal set, the best model achieved $59.69\%$ question accuracy. That’s like getting an A- in a class where most top students are still getting Bs—and no one is near a perfect score.
On text-only, leaders performed stronger, with a top tier around the mid-to-high $50\%$ question accuracy.
On multimodal, everyone dipped: even strong models hovered around the mid-to-high $40\%$ question accuracy, with a few proprietary leaders slightly higher.
The gap between Variable Accuracy and Question Accuracy was typically about $5$ – $7$ points, meaning models often got some variables right but missed at least one required piece.

Surprising (and Important) Findings

High atomic skill, low chain reliability
- Unit execution accuracy is often around $0.8$ – $0.9$ for many steps, showing models can answer correctly when the sub-problem is explicitly asked.
- But end-to-end scores are much lower; models struggle to maintain correct intermediate states across long derivations.
One correct intermediate value can be a game-changer
- Single-unit injections (with answers) often boost final accuracy nearly as much as long prefixes (without answers). This proves the main bottleneck is producing and preserving correct intermediate values, not just knowing which questions to ask.
Reasoning inefficiency shows up as longer chains
- Model-generated solutions use more steps than instructor solutions in both text and multimodal subsets (about $14\%$ and $18\%$ longer on average, respectively). More steps mean more chances for drift and mistakes.

Making Numbers Meaningful

If a student solves $6$ out of $10$ multi-variable questions perfectly, their $QA=60\%$ ; but if they averaged $75\%$ variables correct per question, their $VA=75\%$ . That gap shows they often missed at least one crucial variable.
A $59.69\%$ top score signals the benchmark is not saturated; there’s real headroom to improve.

What This Tells Us

Today’s best models often “know the moves” but can’t keep the ball from slipping in long plays.
Training that rewards correct intermediate results (not just final answers) and encourages shorter, cleaner chains should lift both accuracy and efficiency.

05Discussion & Limitations

🍞 Top Bread (Hook) Even great students have weaknesses—maybe they rush steps or forget to double-check a middle calculation. Knowing this helps teachers coach better.

🥬 Filling (New Concept: Honest assessment)

Limitations (what this can’t do)
1. It covers many STEM domains but not all; future courses or new styles of problems may require updates.
2. It focuses on objectively verifiable targets; open-ended lab or essay-style reasoning is mostly out of scope.
3. It relies on a judge model for extraction and verification; while validated, judge errors are still possible (though reduced).
4. Results reflect current top models; future models might shift difficulty or reveal new failure modes.
Required resources (to use it well)
1. Access to models that can produce detailed reasoning.
2. The provided variable annotations and reasoning flows.
3. The S2S extraction and verification pipeline with a reliable judge model.
When not to use (situations where it may fall short)
1. Pure creativity or wide-open exploration tasks.
2. Problems that depend on physical experiments or subjective grading.
3. Tasks where partial credit must be richly nuanced beyond variable correctness.
Open questions (what we still don’t know)
1. How best to train models to maintain intermediate states over very long chains across diverse topics.
2. How to blend tools (symbolic solvers, calculators, retrieval) without overfitting to benchmark specifics.
3. How to design training signals that reward both correctness and step efficiency simultaneously.
4. How robust S2S remains as models produce more compact or more implicit answers.

🍞 Bottom Bread (Anchor) Think of this like a new kind of practice test that spotlights where a runner trips during a relay. We now see the baton drops in the middle—so the next coaching step is clear: practice the handoffs (intermediate states), not just the sprint finish.

06Conclusion & Future Work

Three-sentence summary

CFE-BENCH is a teacher-verified, multimodal benchmark built from real college STEM problems, paired with a strict variable-based verification protocol that grades exact target values and formulas.
Diagnostics show that while models can usually do individual steps when asked directly, they fail to reliably derive and preserve correct intermediate states across long chains and tend to reason inefficiently with too many steps.
The benchmark is unsaturated—top scores are around $59.69\%$ —making it a strong, realistic testbed for improving true reasoning.

Main achievement

Turning realistic classroom problems into a rigorous, step-diagnostic benchmark with variable-based grading that reduces false positives and reveals the central importance of correct intermediate answers.

Future directions

Train with verified intermediate supervision, integrate trusted tools for computing key middle results, and add objectives that reward compact, efficient chains.
Explore broader domains and more image-heavy problems, and keep evaluating judge robustness as models evolve.

Why remember this

CFE-BENCH shifts the focus from sounding smart to being right at every step. It shows that the heart of reliable STEM reasoning is getting the middle right—and keeping it right—until the very end.

Practical Applications

•Build STEM tutors that grade and give feedback on specific wrong variables, not just final answers.
•Train models with verified intermediate targets to reduce algebra slips in long derivations.
•Integrate symbolic solvers or calculators to compute key intermediate values reliably.
•Design classroom assignments that include variable-based rubrics for faster, fairer grading.
•Create debugging tools that flag where a model’s reasoning first drifted from the correct flow.
•Use single-unit injections in tutoring to deliver the smallest helpful hint (a key intermediate) that unlocks success.
•Benchmark multimodal reasoning in labs where diagrams and plots matter for correct conclusions.
•Optimize inference strategies to prefer shorter, efficient chains (fewer steps, less error drift).
•Develop curricula that reward correct intermediate states to build robust problem-solving habits.
•Adopt S2S verification in production systems to prevent false positives from persuasive but wrong explanations.

Version: 1