Our First Proof submissions

OpenAI Blog

Our First Proof submissions

Intermediate

OpenAI Blog2/20/2026

Key Summary

•This paper shares how an AI tried to solve 10 tough, research‑level math problems and produced full proofs, not just short answers.
•Experts think at least five of those AI-written proofs are likely correct, while some others are still being checked and one was found incorrect after review.
•The challenge, called First Proof, tests if AI can build long, careful arguments that experts can verify, which is harder than typical benchmarks.
•The team trained a new reasoning-focused model to think more rigorously and for many hours, then used light human guidance to refine and present the proofs.
•They also used back-and-forth with another AI (ChatGPT) for formatting, clarity, and some verification support.
•Past successes (like IMO-level performance and a physics result later proved) hinted this approach might work on research-grade problems.
•The process wasn’t perfectly controlled; selections and retries were guided by human judgment, and evaluations relied on expert review.
•The big idea is that frontier challenges reveal whether AI can sustain long chains of logic, choose good abstractions, and survive expert scrutiny.
•If this keeps improving, AI assistants could help mathematicians and scientists reason more carefully and check complex arguments faster.

Why This Research Matters

If AI can write checkable, expert-approved proofs, it can help mathematicians discover new results faster and with fewer errors. The same skills—planning, verifying, and revising long arguments—transfer to science and engineering, where one bad step can cause big failures. This could speed up research in areas like physics and biology by helping teams test ideas and catch mistakes earlier. It can also make complex reasoning more accessible to students and smaller labs that lack large expert teams. Over time, integrating AI with formal proof tools may raise the overall standard of correctness in technical work. That means safer software, more reliable designs, and quicker progress on hard problems.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

Let’s set the stage by meeting the key ideas one by one, using our sandwich pattern so nothing feels mysterious.

🍞 Hook: You know how in court a lawyer must show step-by-step evidence, not just shout the answer? 🥬 The Concept (Mathematical Proof): A mathematical proof is a careful, step-by-step explanation that shows beyond doubt that a statement is true. How it works:

Start from known facts (definitions and theorems).
Move in tiny, justified steps where each step follows from the last.
End at the exact claim you wanted to show. Why it matters: Without proofs, we might trust guesses that look right but are wrong. 🍞 Anchor: Saying “all even numbers are divisible by 2” is obvious, but a proof shows it formally: even means 2×k, so division by 2 leaves k.

🍞 Hook: Imagine teaching a robot to do chores—first it must learn what a broom is and how to plan steps. 🥬 The Concept (Artificial Intelligence, AI): AI is a computer system that learns patterns and rules so it can solve problems that usually need human thinking. How it works: It trains on examples, learns internal rules, and applies them to new situations. Why it matters: Without learning and rules, it would just guess randomly. 🍞 Anchor: A translation AI reads English and outputs Spanish by learning patterns from many sentence pairs.

🍞 Hook: Think of a school’s toughest final exam where questions need full essays, not multiple choice. 🥬 The Concept (First Proof Challenge): First Proof is a research-level test to see if AI can write complete, checkable mathematical proofs for domain-specific problems. How it works: Experts design hard, specialized problems; AI produces whole arguments; experts review correctness. Why it matters: Short answers are easy to grade, but real math progress needs end-to-end reasoning. 🍞 Anchor: Instead of “What’s 2+2?”, it’s “Prove this deep geometry statement using three lemmas you must invent.”

🍞 Hook: When your teacher grades homework, they look for clear steps, not just a final number. 🥬 The Concept (Checkable Proofs): A checkable proof is written so that another person (or a tool) can verify every step. How it works: Each step cites a rule or earlier result; gaps are minimized; notation is precise. Why it matters: Without checkability, we can’t trust if the conclusion truly follows. 🍞 Anchor: A proof that highlights which theorem justifies each step lets a grader tick boxes quickly.

🍞 Hook: A jigsaw puzzle from the ocean set won’t fit pieces from the space set. 🥬 The Concept (Domain-Specific Problems): These are math problems in specialized areas (like topology or number theory) that need particular background knowledge. How it works: They use custom definitions and tools that don’t transfer directly across fields. Why it matters: Without domain know-how, you might pick the wrong tools and get stuck. 🍞 Anchor: Using calculus tricks won’t help much in a combinatorics counting puzzle with graph-specific rules.

🍞 Hook: Picture a detective linking clues into a single, unbroken story. 🥬 The Concept (Reasoning Models): These AI systems are designed to chain many logical steps, keep track of assumptions, and avoid contradictions. How it works: They plan, break problems into subgoals, evaluate progress, and revise when stuck. Why it matters: Without careful reasoning, the AI might make a tiny mistake early that ruins everything later. 🍞 Anchor: To prove a hard claim, the model might first prove three small lemmas that assemble into the final result.

🍞 Hook: A soccer player gets better with drills, scrimmages, and coaching feedback. 🥬 The Concept (AI Model Training): Training is how an AI improves—by practicing on examples, receiving guidance, and adjusting its internal parameters. How it works: Present tasks; score performance; nudge the model to reduce future mistakes; repeat. Why it matters: Without training, the model won’t generalize or improve with experience. 🍞 Anchor: After thousands of practice proofs, the model learns when to define a helper function or state a lemma.

🍞 Hook: Before publishing a book, an editor checks the facts and fixes unclear sentences. 🥬 The Concept (Expert Review): Specialists read the proof to confirm correctness, clarity, and completeness. How it works: Experts look for gaps, misused theorems, or hidden assumptions, and request clarifications. Why it matters: Without expert review, mistakes can sneak through and mislead others. 🍞 Anchor: A number theory professor flags a missing justification and asks the author to add a lemma.

The world before: Most AI math tests asked for short answers or contest-style solutions. They checked if an AI could jump to the right result or solve a tidy problem within a few steps. That’s helpful, but real research problems are different: they are long, messy, and need sustained reasoning, good choices of abstraction, and careful writing so that other experts can actually check the work.

The problem: Could AI do research-grade math—build an end-to-end proof in a specialized domain that stands up to expert scrutiny? This is much harder than guessing an answer or filling in a routine step. It requires hours of consistent, careful thinking and precise language.

Failed attempts: Earlier benchmarks often didn’t stress the longest chains of reasoning. Some systems produced correct-looking but uncheckable arguments, skipped steps, or applied the wrong tools for the domain. Others lost track of assumptions mid-proof or made a small error that broke the entire argument.

The gap: We needed a setting that (1) forces sustained reasoning, (2) demands checkability, and (3) uses problems hard enough that guessing won’t work. First Proof fills this gap by offering expert-authored, domain-specific challenges with reviewing by specialists.

Real stakes: In math and science—and even in everyday tech—careful reasoning prevents costly mistakes. If AI can reliably build and check long arguments, it could help researchers find new theorems, catch subtle errors, and accelerate discovery. Beyond math, the same skills matter for planning scientific experiments, verifying software, or designing complex systems where one wrong step can break the whole plan.

02Core Idea

Here’s the heart of this work.

🍞 Hook: Imagine trying to cross a river using stepping stones; you can only reach the other side if every stone is solid and placed just right. 🥬 The Concept (Rigor in Reasoning): The key insight is to make the AI’s thinking more rigorous—so each step is justified, linked, and stable over long stretches of time. How it works: Train the model to plan ahead, verify intermediate steps, and revise shaky parts until the whole chain is sturdy. Why it matters: One weak step can sink the entire proof; rigor makes the path trustworthy. 🍞 Anchor: The AI states a lemma, proves it carefully, cites it later, and never forgets its conditions.

Aha! moment in one sentence: Treat research-grade math proofs as long, checkable journeys and train AI specifically to stay rigorous, patient, and self-correcting across the entire trip.

Multiple analogies:

Hiking guide: The model maps the trail (plan), checks each marker (verification), and adjusts course when fog rolls in (revision).
Orchestra conductor: It coordinates many instruments (lemmas) so they stay in tune (consistency) and finish together (final theorem).
Chef’s tasting menu: Each course (subproof) must be well-prepared and timed so the whole meal (proof) works; one undercooked dish ruins it.

Before vs. after:

Before: AI often nailed short problems but stumbled on long arguments, forgot assumptions, or used the wrong field’s tools.
After: This approach shows AI can sustain multi-hour reasoning, choose fitting abstractions, and produce proofs that experts judge likely correct on several research-level problems.

Why it works (intuition, not equations):

Long memory of goals: By planning subgoals (lemmas), the AI keeps a stable blueprint.
Step-by-step checking: The model pauses to assess whether a step truly follows, catching issues early.
Domain awareness: It selects tools that match the field, avoiding dead ends.
Iterative refinement: When experts ask for clarity, the model expands and reorganizes until gaps close. These habits reduce the chance of early mistakes snowballing into failure.

🍞 Hook: When you build a tower of blocks, you test each layer so the tower doesn’t wobble later. 🥬 The Concept (Verification): Verification is the practice of checking that each proof step is valid and the overall structure is consistent. How it works: Use explicit justifications, cross-references, and (when possible) automated checks; invite expert review to catch subtle issues. Why it matters: Without verification, confident-sounding arguments can still be wrong. 🍞 Anchor: The model cites a standard theorem; a checker (or expert) confirms the theorem applies because the conditions were met two steps earlier.

Building blocks (smaller pieces):

Goal decomposition: Break the main theorem into lemmas.
Tool selection: Pick field-appropriate definitions and theorems.
Proof drafting: Write a first pass quickly to see the shape.
Self-audit: Re-read for gaps, contradictions, or missing conditions.
Expert feedback: Incorporate comments to improve clarity and correctness.
Presentation polish: Format the final proof so others can check it line by line. Together, these pieces help transform raw ideas into a checkable proof.

03Methodology

At a high level: Problem statement → Plan (subgoals/lemmas) → Draft proof → Self-check and revise → Expert feedback loop → Final proof candidate.

Step 1: Understand the problem

What happens: The AI reads the domain-specific statement, identifies definitions, and clarifies the exact claim to prove.
Why this step exists: Misreading the problem leads to proving the wrong thing or applying the wrong tools.
Example: If the statement is about graphs with certain edge constraints, the AI explicitly restates those constraints and the target property to show.

Step 2: Plan by decomposition

What happens: The AI proposes a proof outline with lemmas that would, together, imply the main result.
Why this step exists: Complex results are rarely solved in one move; breaking them down reduces cognitive load and creates checkpoints.
Example: To prove a global property, the plan may include (i) a structural lemma, (ii) a counting lemma, and (iii) a final assembly argument linking (i) and (ii).

Step 3: Select domain tools

What happens: The AI picks definitions and theorems relevant to the field (e.g., compactness in topology or multiplicative functions in number theory).
Why this step exists: Using off-domain tools wastes time and can create logical mismatches.
Example: For a combinatorics problem, it selects the pigeonhole principle or extremal lemmas instead of calculus tricks.

Step 4: Draft the proof

What happens: The AI writes a first-pass argument for each lemma and the main theorem, marking places that feel uncertain.
Why this step exists: A draft reveals the overall structure and surfaces fragile spots early.
Example: It sketches a counting bound, noting “needs tighter constant” as a to-do.

Step 5: Self-check (local verification)

What happens: The AI verifies each step’s justification, checks that conditions for cited theorems are satisfied, and ensures no circular reasoning.
Why this step exists: Tiny local errors can topple the entire proof; catching them early saves time.
Example: Before using a theorem that requires connectivity, it confirms the object was proven connected two steps earlier.

Step 6: Global consistency check

What happens: The AI reviews the whole chain: assumptions flow forward, lemmas are used correctly, and there are no hidden dependencies.
Why this step exists: Even if local steps are fine, a mismatch between lemmas can break the global logic.
Example: It confirms Lemma B’s conclusion truly matches the hypothesis of Lemma C used in the finale.

Step 7: Targeted retries and strategy shifts

What happens: If a subproof seems brittle, the AI tries alternate strategies learned during training (e.g., strengthen a lemma, redefine a term, or switch proof techniques).
Why this step exists: Stubborn dead-ends need creative redirection, not just polishing.
Example: When an induction stalls, it reframes the invariant or moves to a contradiction argument.

Step 8: Clarity and presentation pass

What happens: The AI reorganizes sections, adds missing definitions, labels claims, and standardizes notation.
Why this step exists: Checkability depends on clarity; unclear steps are hard to verify.
Example: It renames variables to avoid clashes and adds a summary paragraph before the finale.

Step 9: Expert feedback loop

What happens: Human experts review and request expansions or clarifications; the AI responds by filling gaps or simplifying arguments.
Why this step exists: Experts catch subtle domain-specific issues and ensure the argument is readable and trustworthy.
Example: After feedback, the AI adds an appendix proving a side claim that was used implicitly.

Step 10: Auxiliary AI support

What happens: A companion AI (e.g., ChatGPT) helps with formatting, style, and preliminary verification passes.
Why this step exists: Presentation and surface-level checks speed up expert review and reduce miscommunication.
Example: It converts the proof to a consistent theorem–lemma–proof structure and flags unclear references.

Step 11: Selection and submission

What happens: From multiple attempts, the best candidate proof is chosen based on coherence, completeness, and expert feedback.
Why this step exists: Different runs may explore different strategies; picking the strongest increases the chance of correctness.
Example: Of three variants, the team picks the one with the tightest bound and clearest lemma dependencies.

The secret sauce

Extended reasoning capacity: The model is trained to think longer without drifting, keeping track of assumptions for many hours.
Rigor-focused tuning: Training emphasizes checkability, step justification, and self-correction habits.
Guided retries: Light human nudges suggest fruitful strategies seen in earlier runs, steering exploration while keeping autonomy.
Feedback integration: The model improves proofs after expert comments, tightening weak spots. These ingredients help the AI move from flashy guesses to sturdy, checkable arguments.

Concrete mini-walkthrough

Input: Problem #X in a specialized field with tight constraints.
Plan: Propose three lemmas that, combined, yield the theorem.
Draft: Write proofs for Lemma 1 and 2; mark Lemma 3 as uncertain.
Self-check: Spot that Lemma 2 needs an extra assumption; add and propagate it.
Retry: Switch Lemma 3 from induction to a minimal counterexample approach.
Feedback: Expert asks for a justification in Lemma 1; AI adds a detailed sublemma.
Output: A formatted, line-by-line proof with references, now easier to verify.

04Experiments & Results

The test: The team ran an internal reasoning-focused AI model on all 10 First Proof problems—hard, expert-authored, domain-specific math challenges. The key measurement was not speed or a single final number, but whether the produced proofs were correct and checkable by experts.

What was compared: There wasn’t a simple leaderboard of other AIs doing the exact same thing, because First Proof targets research-level reasoning rather than standard benchmarks. Instead, context came from human performance (an academic department with domain overlap could plausibly solve many in a week) and prior AI milestones (like IMO-level results and physics collaborations). This places the attempt in a landscape where sustained reasoning and expert verification are the main yardsticks.

Scoreboard with context:

Likely correct proofs: Experts believe at least five proofs (problems 4, 5, 6, 9, and 10) have a high chance of being correct. That’s like getting half the questions not just answered but fully essay-graded as A-level work when the exam was designed by top professors.
Under review: Several other proofs are being examined; these could move up or down as experts dig in.
Correction made: An early belief that problem 2 was solved was later overturned after the official commentary and community analysis; it’s now considered incorrect. This shows the system and team updated beliefs with evidence.

Surprising or notable findings:

Day-by-day gains: As the new model trained, it got visibly better, eventually reaching a point where it could likely solve multiple additional problems. That suggests that rigor-focused training can yield rapid, tangible improvements in research-grade reasoning.
Proof polish matters: Expanding and clarifying parts of proofs—sometimes after expert prompts—made verification easier and increased confidence in correctness. Presentation isn’t just cosmetic; it directly affects checkability.
Limited supervision worked: Even with light human guidance (like suggesting retry strategies) and auxiliary AI for formatting, the model could sustain long chains of thought and deliver plausible research-level proofs.

Meaning of the numbers: Saying “five likely correct proofs out of ten” might sound like a coin flip. But because these are research-level, many-step arguments where even small errors can ruin the entire structure, achieving that many likely-correct, expert-vetted proofs is closer to scoring an A+ on the hardest parts of an exam where most others might not finish. It shows the model isn’t just guessing; it’s building structured arguments that survive scrutiny.

Connection to earlier results: Previous milestones—gold-medal performance on the IMO benchmark and a physics result later formally proved—hinted that the model’s reasoning scaffolding could extend to research settings. First Proof strengthens that case by shifting from contest problems to specialist, end-to-end proofs.

Takeaway: The experiments show that with the right training and workflow, an AI can step into the demanding world of research mathematics, producing proofs that real experts judge as likely correct, while also revealing where and why mistakes happen.

05Discussion & Limitations

Limitations

Evaluation control: The sprint was fast and not perfectly controlled. Human selection of the “best attempt,” light prompting guidance, and evolving training blur a clean, apples-to-apples comparison across problems or time.
Reliance on expert review: Without formal proof checkers for every domain-specific step, correctness judgments depend on human experts, which can be slow and sometimes disagree.
Fragile spots remain: One proof initially thought correct (#2) was later judged incorrect. This shows the approach can miss hidden assumptions or apply a theorem outside its valid conditions.
Generality vs. specialization: Excelling on a set of handpicked, expert problems does not yet mean the model can solve arbitrary research questions on demand.

Required resources

A frontier-scale reasoning model trained for long-horizon, rigor-focused thinking.
Access to domain experts for feedback and verification.
Auxiliary AI tools (for formatting and preliminary checks) and infrastructure for running long sessions.
Time to iterate: multiple drafts, retries, and review cycles.

When not to use

Fast, low-stakes questions where a short answer suffices; the overhead of rigorous proof-building is unnecessary.
Domains lacking clear definitions or where correctness criteria are subjective; the model shines when rules are well-specified.
Settings without expert availability; unreviewed outputs may look convincing yet still be wrong.

Open questions

Formalization: How much of expert review can be moved into formal proof assistants so that checking becomes automatic and unambiguous?
Reliability: What training signals most improve the model’s ability to avoid subtle, late-stage errors that pass casual inspection?
Transfer: How well do these reasoning skills move across very different fields (e.g., from combinatorics to algebraic geometry)?
Scalability: Can the model sustain week-long reasoning with consistent rigor, and can this be done cost-effectively?
Collaboration protocols: What workflows best combine human mathematicians, AI proof writers, and automated checkers to reduce errors and latency?

06Conclusion & Future Work

Three-sentence summary: This work puts an AI to the test on the First Proof challenge, where the goal is not just to answer questions but to write complete, checkable proofs for tough, specialized math problems. With training that emphasizes rigor and long-horizon reasoning—and with light human guidance and expert review—the AI produced several proofs that experts believe are likely correct, while others remain under review and one was corrected as incorrect. The results suggest AI can participate meaningfully in research-grade reasoning, though careful evaluation and formal verification remain crucial.

Main achievement: Demonstrating that an AI can sustain multi-hour, domain-aware reasoning to produce end-to-end, expert-checkable proofs on multiple research-level problems—not as isolated tricks, but as structured, revisable arguments.

Future directions: Integrate more formal proof checking to reduce reliance on human judgment; refine training signals that reward step-level justification; build better collaboration loops between humans, AI writers, and automated verifiers; and expand testing to broader, more diverse problem sets.

Why remember this: It marks a shift from “Can AI get the right answer?” to “Can AI build an argument experts trust?” If that trend continues, AI could help discover new theorems, catch subtle mistakes, and make high-level reasoning more accessible—accelerating progress not only in mathematics but across science and engineering where correctness truly matters.

Practical Applications

•Assisting mathematicians in drafting and refining proofs with clearer steps and justifications.
•Supporting peer reviewers by highlighting gaps, unstated assumptions, or misapplied theorems in submitted papers.
•Helping educators generate step-by-step, checkable solutions for advanced homework and exam problems.
•Accelerating formalization by turning human-written proofs into structures suitable for proof assistants.
•Providing research teams with alternative proof strategies when a current approach stalls.
•Pre-screening technical documents (math-heavy grant proposals or reports) for logical consistency.
•Training students to think rigorously by offering feedback on missing steps or unclear definitions.
•Auditing safety-critical reasoning in engineering plans, ensuring every claim is justified.
•Exploring new conjectures by proposing lemmas and checking whether they combine into a plausible theorem.

Version: 1