Sci-CoE: Co-evolving Scientific Reasoning LLMs via Geometric Consensus with Sparse Supervision

Xiaohan He; Shiyang Feng; Songtao Huang; Lei Bai; Bin Wang; Bo Zhang

Sci-CoE: Co-evolving Scientific Reasoning LLMs via Geometric Consensus with Sparse Supervision

Intermediate

Xiaohan He, Shiyang Feng, Songtao Huang et al.2/12/2026

arXiv

Key Summary

•Sci-CoE is a two-stage training method that helps one language model learn to both solve science problems and check those solutions with very little labeled data.
•Stage 1 (Anchored Learning) uses a tiny set of answer-key questions to teach basic ideas of what 'correct' looks like for both solving and verifying.
•Stage 2 (Unsupervised Co-evolution) removes answer keys and lets the Solver and Verifier improve each other using agreement signals and a special geometric reward.
•The geometric reward keeps verification strategies both reliable (not drifting off-topic) and diverse (covering many different checks), so the system doesn’t collapse to one narrow way of judging.
•Across tough science benchmarks (MMLU-Pro, GPQA-Diamond, UGPhysics), Sci-CoE improves accuracy over strong base models, including a 4.04% jump on GPQA-Diamond for Qwen3-8B.
•With more unlabeled data (18k to 30k questions), performance keeps rising, showing strong scalability without needing more labels.
•Ablation studies show Stage 1 anchors are essential, and the geometric reward clearly beats a naive consensus-only reward by preventing strategy homogeneity.
•At inference time, the learned Verifier helps pick the best answer among many (Best-of-N), boosting practical reliability.
•Sci-CoE still needs an external judge model during training to follow verification strategies, which adds compute and potential bias, but the approach remains label-efficient.
•Overall, Sci-CoE offers a path to stronger, more trustworthy scientific reasoning in LLMs with minimal human supervision.

Why This Research Matters

Science questions often don’t have simple checkers, so training an AI that can both solve and fairly judge solutions unlocks safer, more helpful tools. With Sci-CoE, we can use tiny labeled sets and lots of unlabeled problems to grow a model that understands not just answers, but the reasoning checks behind them. This empowers better homework help, more reliable lab calculations, and stronger decision support in fields like engineering and medicine. The geometric reward keeps the AI’s checking habits broad and steady, so it doesn’t miss important types of mistakes. As more unlabeled science data becomes available, performance can keep rising without huge labeling costs. That means faster progress for education, research, and industry with fewer barriers.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how in science class, you don’t just give an answer—you also explain how you got it, and someone checks if your steps make sense? Before this paper, large language models (LLMs) were decent at giving answers in math or code because those areas have clear checkers (like unit tests for code or exact answers for math). But in open-ended science problems, there often isn’t a single easy checker. Different correct paths exist, and checking them can require subject-matter judgment.

What the world looked like before: LLMs had started to show impressive reasoning when paired with reinforcement learning (RL). Some systems could even improve by playing against themselves (self-play). This worked well in coding (where tests verify behavior) and in certain math tasks (where exact answers exist). In these domains, feedback is clean: pass a unit test or match the numeric answer, and you’re correct. Because of that clean feedback, models could evolve quickly and reliably.

The problem: Scientific reasoning is messier. Two people might solve a physics or chemistry question with different steps and still be right. Evaluating such solutions often requires expert checks like, “Do the units make sense?”, “Is this consistent with thermodynamics?”, or “Does the logic hold in every step?” Building huge labeled sets where each problem has a perfect answer key and a detailed verification rubric is very expensive. Without reliable checks, a model can accidentally learn to be more confident in wrong answers because it doesn’t get clear signals telling it what’s wrong.

What people tried: Researchers explored self-play setups where a model generates and then verifies its own work, which cut down on the need for labels. In code, this shines because the verifier (unit tests) is solid. But in general science, early attempts struggled. Without strong signals, verifiers became too samey (not diverse), or they rewarded shallow tricks (like format checks) instead of deep scientific logic. Other works tried to use internal consistency alone, but that can lead to echo chambers where the model agrees with itself without truly being correct.

The missing piece (the gap): We needed a way to train both the solver and the checker together that (1) works with very few labeled problems, (2) doesn’t fall apart when there’s no answer key, and (3) keeps verification strategies both reliable and varied so they don’t collapse into one narrow way of judging. In other words, we needed a way to replace exact answer keys with something just as steady, but more flexible—especially for open-ended science.

Real stakes: If we can trust AI to reason about science more robustly, it can help students study, speed up research brainstorming, assist doctors and engineers in checking steps, and help scientists avoid costly mistakes. On the flip side, if we can’t verify reasoning well, an AI might sound confident yet suggest a wrong equation, a mismatched unit, or a flawed logic chain—leading to confusion or even danger in real-world settings.

New concepts (with Sandwich explanations):

🍞 Hook: Imagine learning a new game when you only get to watch a few rounds. 🥬 Sparse Supervision: It means teaching a model with very few labeled examples. How it works: you show a handful of problems with correct answers, just enough to set basic rules of right and wrong; then you let the model learn patterns from mostly unlabeled data. Why it matters: Without it, you’d need tons of labeled examples, which is slow and expensive. 🍞 Anchor: Like learning chess by studying five classic games, then playing a lot to improve.

🍞 Hook: Think of training a puppy by giving treats for good behavior. 🥬 Reinforcement Learning (RL): A way for models to learn by trying things and getting rewards. How it works: the model proposes an action (like a solution), gets feedback (a reward), and updates to do better next time. Why it matters: Without rewards, the model has no compass to know what to improve. 🍞 Anchor: The puppy sits, gets a treat, and learns to sit more often.

🍞 Hook: You know how a coach tweaks game plans but not too wildly each time? 🥬 Proximal Policy Optimization (PPO): A safe, steady way to update the model in RL. How it works: try new actions, measure reward, adjust the strategy but keep the change within safe limits. Why it matters: Without PPO-like limits, updates can swing too far and break what already works. 🍞 Anchor: A coach adjusts tactics a bit after each match instead of reinventing the playbook daily.

02Core Idea

🍞 Hook: In class, it helps to have a student who solves problems and a teacher who checks the steps. If both learn together, the whole class gets better. 🥬 The Aha: Train one LLM to be both a Solver (generates scientific answers with steps) and a Verifier (creates strategies to check those answers), and let them co-evolve using a geometric consensus reward that favors reliable and diverse checking even without answer keys. 🍞 Anchor: It’s like learning to write essays while also learning to grade them fairly and from different angles.

Multiple analogies for the same idea:

Two-lens camera: One lens captures the picture (Solver). The other lens checks focus, lighting, and color balance (Verifier). Geometric consensus is the algorithm making sure the second lens doesn’t get stuck checking only brightness; it also checks sharpness, contrast, and color balance so pictures look truly good.
Debate club: One student proposes an argument (Solver). Several judges (Verifier strategies) evaluate it from different rules—logic, evidence, clarity, and consistency. Geometric consensus ensures the judges don’t all use the same rule and don’t drift off-topic.
Science fair: A project is built (Solver) and judged on accuracy, method, creativity, and safety (Verifier strategies). The geometric part spreads judges across these categories so no single criterion dominates.

Before vs After:

Before: Models needed lots of labels or clear tests; in science, checks were fuzzy, so training was fragile and often collapsed to shallow tricks.
After: With a tiny anchor set, the model bootstraps basic correctness. Then, using unsupervised co-evolution plus geometric rewards, it grows richer verification habits and stronger reasoning without huge answer keys.

Why it works (intuition):

Agreement as a stand-in: If many well-designed verification strategies pass a solution, it’s probably good. If only a few pass it, it’s likely weak.
Geometry keeps balance: By mapping strategies into a vector space, the system can reward strategies that stay near the stable center (reliable) yet spread around the circle (diverse), covering different scientific checks. This avoids consensus collapse (everyone saying the same thing) and prevents drift (weird, off-topic checks).

Building blocks (each with Sandwich explanations):

🍞 Hook: Picture a group project where one person writes the report and another creates the grading checklist. 🥬 Solver and Verifier Roles: One brain, two jobs—generate solutions (Solver) and design checks (Verifier). How it works: for each question, produce several solutions and several checklists; pair them to see which solutions pass which checks. Why it matters: Without a good Verifier, the Solver can’t learn which reasoning paths are solid. 🍞 Anchor: A student writes a lab report; another student’s rubric ensures the report really follows scientific method.

🍞 Hook: Think of checklists a pilot uses before takeoff. 🥬 Verification Strategies: Written instructions that say how to judge a solution (e.g., check units, reverse the calculation, test boundary conditions). How it works: list the steps to follow, then a judge applies those steps to accept/reject a solution. Why it matters: Without precise strategies, judging becomes fuzzy and inconsistent. 🍞 Anchor: A checklist that confirms an airplane’s fuel, flaps, and instruments before flying.

🍞 Hook: Bees and flowers improve together—each change in one shapes the other. 🥬 Co-evolution: The Solver improves when the Verifier becomes sharper, and the Verifier improves when the Solver proposes better solutions. How it works: they alternate learning from each other’s feedback. Why it matters: Without co-evolution, you might get a great solver with a weak checker, or vice versa. 🍞 Anchor: Sparring partners in sports who make each other stronger over time.

🍞 Hook: Friends voting on a movie are more convincing if they’re using different reasons that make sense. 🥬 Geometric Consensus: A way to score strategies based on where they sit in a map of ideas (embedding space). How it works: strategies near the center are steady; spreading around the circle means diverse viewpoints; you combine these with agreement on which solutions pass. Why it matters: Without it, strategies collapse to one simple check, missing subtle scientific errors. 🍞 Anchor: A committee where judges cover logic, math, units, and real-world sense rather than all judging the same thing.

🍞 Hook: Imagine lining up different rulers to measure length so no single ruler’s flaw ruins the measurement. 🥬 Diversity and Reliability in Verification: Aim for many solid, different checks. How it works: reward checks that stick to the topic and also cover new angles. Why it matters: Without diversity, you miss errors; without reliability, you reward nonsense. 🍞 Anchor: Multiple, trustworthy yardsticks catching different kinds of mistakes.

03Methodology

At a high level: Question → generate N solutions and M verification strategies → cross-check each solution with each strategy using a judge → compute rewards for Solver and Verifier → update one shared model with PPO → repeat.

Stage 1: Anchored Learning (tiny labeled set) Goal: Create basic anchors for correctness and trustworthy checking.

What happens: Use a small set (about 1%–10%) of questions with answer keys. For each question, the model plays both roles: generate multiple solutions and multiple verification strategies. Use a judge to apply each strategy to each solution (True/False). Reward solutions that match the answer key; reward strategies that accept all correct solutions and reject many incorrect ones. Update the single shared model in two steps (first Solver, then Verifier) for stability.
Why this step exists: Without anchors, the system can drift and reinforce wrong habits. Sparse anchors teach the model what correctness roughly looks like and what a good checker should do.
Example data: Suppose the ground truth final answer is 42. A solution that reasons soundly and ends with 42 gets reward 1; otherwise 0. A strategy that passes all correct-42 solutions but fails many wrong-answers gets positive reward.

Sandwich concept intros used in this stage:

🍞 Hook: Learning to ride a bike with training wheels first. 🥬 Sequential Optimization: Update Solver first with correctness rewards, then Verifier with discriminative rewards. How it works: two mini-steps per round keep learning steady. Why it matters: Updating both at once early can wobble and crash. 🍞 Anchor: Practice pedaling, then balance drills—don’t do both chaotically at once.

Stage 2: Unsupervised Co-evolution (big unlabeled set) Goal: Scale up without answer keys.

What happens: For each new unlabeled question, the model again generates multiple solutions and strategies. The judge applies each strategy to each solution. Now, the Solver’s reward is how many strategies pass its solution (consensus). The Verifier’s reward mixes three parts: (1) consistency with high-consensus solutions, (2) reliability from geometry (stay near cluster center), and (3) diversity from geometry (spread around angles). Update the shared model jointly (Solver and Verifier samples in one PPO batch).
Why this step exists: It removes the need for labeled answers while keeping learning stable and meaningful.
Example data: If 12 out of 15 strategies pass a solution, it earns a high reward. A strategy close to others in idea-space but not too close (reliable and distinct) earns a good reward.

Sandwich concept intros for Stage 2 tools:

🍞 Hook: Voting on the best science project by many judges following their rubrics. 🥬 Consensus Threshold: A cutoff (like 80%) that says, “This solution passed most good checks, so let’s treat it as likely good.” How it works: count passes; if above threshold, it’s a high-consensus solution. Why it matters: Prevents a single lenient or harsh strategy from dominating the signal. 🍞 Anchor: If 12 out of 15 judges say “yes,” the project is likely solid.

🍞 Hook: Turning sentences into points on a map so similar ideas sit close together. 🥬 Embedding: Convert each verification strategy into a vector (a point) representing its meaning. How it works: a pretrained embedding model maps text to coordinates. Why it matters: Without a map, we can’t measure reliability or diversity geometrically. 🍞 Anchor: Similar checklists (like two unit checks) land near each other on the map.

🍞 Hook: Sorting marbles into bowls where similar colors cluster together. 🥬 K-means Clustering: Group strategy vectors and find each group’s center. How it works: assign each strategy to the nearest center; recompute centers; repeat. Why it matters: The center becomes a stable reference point for reliability. 🍞 Anchor: The average color of a bowl tells you what that bowl is mainly about.

🍞 Hook: Looking at a giant painting from far away by sketching a small 2D version. 🥬 PCA (Principal Component Analysis): Squash high-dimensional vectors into 2D to study spread by angles. How it works: find main directions of variation; project points; measure angular spread. Why it matters: Encourages strategies to fan out, covering different evaluation angles. 🍞 Anchor: A 2D map that shows whether judges cover all directions or bunch up.

🍞 Hook: A referee who follows the exact rulebook, not personal hunches. 🥬 External Judge Model: A large LLM used only to apply a specific verification strategy to a solution and return True/False. How it works: it strictly follows the strategy text to decide. Why it matters: Keeps evaluation faithful to the strategy instead of free-form judging. 🍞 Anchor: The referee checks fouls exactly as written, not by vibe.

🍞 Hook: Practicing with many attempts and picking the best one. 🥬 Best-of-N (BoN): Generate many candidate solutions; use learned verification to pick the best. How it works: score each with strategies; choose the top-scoring one. Why it matters: Even if average quality is okay, selection can make the final answer much better. 🍞 Anchor: Submit your strongest essay draft after reviewing all drafts with a good rubric.

The Secret Sauce: Geometric Reward

Reliability: Strategies too far from their cluster center risk being off-topic or noisy; closeness gets rewarded.
Diversity: Strategies that cover new angles (spread around the circle) get rewarded; duplicates get less.
Consistency: Strategies that bless high-consensus solutions but catch likely-bad ones get rewarded. Together, these three forces produce a stable, many-sided, and trustworthy verification system that, in turn, trains a better Solver.

Putting it all in a recipe: Input: A science question (labeled in Stage 1; unlabeled in Stage 2). Steps: Generate N solutions and M strategies → Judge each pair → Compute rewards (consensus for Solver; geometric+consistency for Verifier) → Update the single LLM with PPO. Output: A stronger, more reliable reasoner and a smarter internal checker.

04Experiments & Results

The tests: The authors measured how well the trained models solved tough science questions across many subjects and how effective the learned verification strategies were. They used official scoring tools to keep everything fair.

Benchmarks and models:

Datasets for training: Small labeled anchors from MegaScience and NuminaMath; larger unlabeled sets (18k and 30k) mixing topics like physics, biology, chemistry, medicine, math, CS, economics, plus some ScienceQA and CaseHold.
Base models: Qwen2.5-7B-Instruct and Qwen3-8B served as the shared Solver/Verifier policy. A strong Qwen3-235B model acted as the external judge applying strategies.
Evaluation benchmarks: MMLU-Pro (broad, hard multi-subject), GPQA-Diamond (graduate-level, “Google-proof” questions), UGPhysics (undergrad physics reasoning).

Competition and comparisons: They compared Sci-CoE to the untrained base models and to other strong 7–9B open models (Llama-3.1-8B, Yi-1.5-9B, Mistral-Small-Instruct, etc.).

Scoreboard with context:

GPQA-Diamond (Qwen3-8B): Baseline 36.87% → Sci-CoE 40.91%. That’s like jumping from a solid C+ to a firm B, where questions are extremely challenging.
MMLU-Pro (Qwen3-8B): Baseline 63.19% → Sci-CoE 64.34%. Think of moving from an A- to closer to an A on a giant, mixed-subject exam.
UGPhysics: Gains of about +1.97% for the 7B base and +1.34% for the 8B base. This is notable because UGPhysics covers multiple subfields; improvements mean better general scientific habits, not just memorized facts.

Scalability: When they increased unlabeled training data from 18k to 30k problems in Stage 2, performance kept improving without flattening out. That’s a strong sign that the method learns more from more diverse, unlabeled science questions.

Surprising and useful findings:

Anchored Learning is small but mighty: Even with as few as 0.4k labeled anchors, performance improved, showing that just a sprinkle of answer-key data is enough to set a sturdy direction.
Geometric Reward beats Naive Consensus: A simple “just maximize agreement” approach caused the Verifier to collapse into easy tricks (like format checks). The geometric reward balanced consistency, reliability, and diversity, producing a healthier, more helpful checker that lifted results across benchmarks.
Inference-time win (Best-of-N): Not only did average solution quality rise, but using the learned Verifier to pick the best candidate among many gave further boosts—practically valuable when you care about the final answer’s reliability.

What the plots showed (described simply): When projecting verification strategies into a 2D map, the baseline looked scattered and weak; Stage 1 gathered more reliable checks but still clumped in a narrow range; a naive consensus reward squeezed strategies into a few dense clumps (too samey); the geometric reward spread them out evenly around the circle (diverse) while keeping them near the center (reliable), and with greener colors (higher consistency).

05Discussion & Limitations

Limitations:

Model size: Experiments topped out at around 7–8B parameters for the shared Solver/Verifier, so we don’t yet know the full benefits at larger scales.
External judge: Training uses a big judge model to apply strategies, which adds compute cost and could introduce judge bias. Future work may train the judge role internally or distill it to a smaller model.
Domain gaps: On some UGPhysics sub-disciplines, gains weren’t uniform—suggesting that if the unlabeled set doesn’t match a topic, improvement there may lag.
Strategy embeddings: The geometry depends on a particular embedding model; different embeddings might shift the clusters, which could change rewards.

Required resources:

A capable base LLM (7–8B in the paper) and an even larger external judge LLM during training.
PPO RL pipeline with batched rollouts (e.g., using vLLM), clustering (K-means), and PCA.
A small labeled anchor set (hundreds to a few thousand problems) and a larger unlabeled pool (tens of thousands) spanning multiple sciences.

When not to use:

If you already have gold-standard, executable verifiers (like unit tests) for every problem—then simpler reward designs might be cheaper and cleaner.
If your domain demands exact numeric answers and leaves no room for multiple methods (e.g., short arithmetic drills), geometric diversity rewards add little value.
If compute is highly constrained (no room for a judge model or RL), a simpler supervised fine-tuning may be more practical.

Open questions:

Can we replace the external judge with a lighter self-judger or multi-agent committee without losing faithfulness to the strategy?
How best to auto-tune the diversity–reliability trade-off for different domains (e.g., chemistry vs. law)?
Can we adaptively grow the consensus threshold based on difficulty, or detect when strategies are overfitting to surface patterns?
How far does this scale with hundreds of thousands or millions of unlabeled science problems, and what new behaviors emerge?
Could multimodal evidence (figures, tables, equations) be integrated into the verification geometry for even stronger checks?

06Conclusion & Future Work

In three sentences: Sci-CoE trains one LLM to be both a Solver and a Verifier, starting with a tiny bit of labeled data and then growing through unsupervised co-evolution. A geometric consensus reward keeps verification strategies both reliable (near the center) and diverse (spread by angle), which yields steady, robust progress without heavy answer-key dependence. As a result, Sci-CoE shows consistent gains on tough science benchmarks and scales well with more unlabeled data.

Main achievement: Turning vague scientific checking into a stable, learnable space—so a model can teach itself better science reasoning by co-evolving solution-making and strategy-making under minimal supervision.

Future directions: Remove or shrink the external judge; extend to multimodal science (plots, spectra, diagrams); auto-calibrate diversity vs. reliability; test much larger unlabeled pools; and adapt the geometry to different domains. Also explore teaching the model to generate new, curriculum-shaped science problems to amplify growth.

Why remember this: Sci-CoE shows that when you can’t rely on answer keys, you can still grow a trustworthy scientific thinker by co-evolving solving and checking—and by shaping the space of checks with geometry, you prevent collapse and keep learning alive.

Practical Applications

•Homework helper that not only gives a solution but also runs multiple independent checks (units, boundary cases, reverse math) to boost trust.
•Research assistant that proposes hypotheses and then drafts diverse verification plans to stress-test the logic before experiments.
•Engineering design review bot that checks calculations from different angles (safety constraints, physical laws, tolerance ranges).
•Scientific peer-review aid that evaluates manuscripts’ methods with varied verification strategies to catch subtle issues.
•Medical reasoning copilot that cross-checks diagnostic steps (consistency with symptoms, test results, guidelines) without needing full labels.
•Lab protocol validator that reviews experimental steps and flags likely violations of conservation laws or measurement units.
•Educational tool that teaches students how to verify their own work by showing several verification strategies per problem.
•Quality assurance for simulations that verifies outputs against conservation checks, scaling laws, and simplified models.
•Curriculum generator that creates problems and matching verification rubrics to guide self-study.
•Inference-time booster for LLMs that selects the most trustworthy answer among many using the learned Verifier.

Version: 1