FireRed-OCR Technical Report
Key Summary
- •FireRed-OCR turns a general vision-language model into a careful document reader that follows strict rules, so its outputs are usable in the real world.
- •The big problem it fixes is structural hallucination—when models invent or break tables, formulas, or reading order even if the words look right.
- •A special Data Factory balances rare layouts (like weird receipts or nested tables) using both shape clues (geometry) and meaning tags (semantics).
- •Training happens in three steps: learn to see and point (pre-alignment), learn to write clean Markdown (SFT), then learn to obey format rules with rewards (GRPO).
- •The GRPO stage gives points for closed tables, compilable formulas, finished tags, and accurate text, which reduces broken outputs.
- •On OmniDocBench v1.5, FireRed-OCR scores 92.94%, beating strong models like DeepSeek-OCR 2 and OCRVerse across text, formula, table, and reading order.
- •Synthetic (rendered) pages with perfect labels teach the model rare, tricky cases like multi-page tables and deeply nested math.
- •Iterating SFT and GRPO keeps the model smart about content while staying strict about structure, avoiding reward-hacking and forgetting.
- •Even with only ~2B parameters, FireRed-OCR competes with or beats very large general VLMs by being specialized and disciplined.
- •This approach is open-source and shows a clear recipe to turn “general” models into “structural experts” for documents.
Why This Research Matters
Documents run our world—bills, school worksheets, research papers, contracts—and most need both correct text and correct structure. FireRed-OCR makes outputs that are not only readable by people but also digestible by tools, because tables close, formulas compile, and reading order makes sense. This reduces costly manual fixing when importing data into spreadsheets, databases, or grading systems. It shows that smaller, specialized models can match or beat giant generalists when trained with the right data and rules, saving compute and money. The approach is open-source and reusable, creating a blueprint for turning general models into dependable specialists across other structured tasks. Ultimately, it helps organizations move from “OCR that kind of works” to “structured understanding you can trust.”
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
You know how you can read a messy worksheet and still know which box goes with which question because you follow the page’s structure? Computers need that, too.
🍞 Hook: Imagine copying a table from a textbook. If you mix up the rows or forget the last column, your homework answer becomes useless—even if each word is spelled right.
🥬 The Concept: Structural Hallucination
- What it is: When a model outputs words that look fine but breaks the structure—like misordered table rows, missing headers, or formulas that don’t compile.
- How it works (the mistake recipe):
- The model recognizes text chunks.
- It tries to write everything out at once.
- Without strict rules, it guesses the structure, leading to broken tables or unmatched tags.
- Why it matters: Downstream tools (like spreadsheets or math compilers) can’t use broken structure, so the result is practically useless.
🍞 Anchor: The model reads a receipt but prints a table where row 3 has only 2 columns while others have 3. Your accounting tool refuses to import it. That’s structural hallucination.
The world before: Traditional OCR pipelines were like teams: one part finds text boxes, another reads characters. They were pixel-precise but often missed the story order (like which column to read first). Newer end-to-end models learned meaning better, but they often broke the rules of structure—great at “what,” sloppy at “how it must be formatted.” Meanwhile, public datasets skewed toward simple pages (like novels), not tricky layouts (like invoices with nested tables). Also, annotations disagreed on styles (Markdown vs HTML), confusing models during training.
The problem: Turn a general Vision-Language Model (VLM) that’s good at understanding images and text into a rule-following document expert that keeps structure perfectly intact.
Failed attempts:
- Relying only on bigger models: They still hallucinate structure because they’re not taught strict formatting rules.
- Random data sampling: It overfeeds common, easy pages and underfeeds rare, hard layouts.
- Pure supervised training: It can teach style but not enforce it; models still occasionally forget to close tables or match columns.
The gap: We need both (a) data that evenly covers rare, twisty layouts and (b) training that rewards obeying structure rules—not just predicting words.
What this paper adds:
- A Geometry + Semantics Data Factory that purposefully balances hard layouts and unifies all annotations into one clean Markdown style.
- A three-stage progressive training plan: first ground vision to structure, then standardize output, then lock in rules with reinforcement learning rewards.
Real stakes (why you should care):
- Bills, receipts, and invoices must import cleanly into accounting tools.
- Scientific PDFs must keep formulas and tables correct for search and editing.
- Government forms and legal contracts require exact reading order and intact sections.
- Education content (worksheets, exams) needs compilable formulas and sturdy tables for grading.
🍞 Hook: You know how Lego instructions show exactly where each piece goes, step by step?
🥬 The Concept: Geometry + Semantics Data Factory
- What it is: A pipeline that organizes training pages by their shapes (layout geometry) and meanings (semantics) so the model practices on a fair mix—including rare, hard types.
- How it works:
- Cluster pages by visual layout (columns, tables, boxes).
- Tag pages by language, source (scan vs photo), and genre (receipt, contract, paper).
- Sample more from rare clusters (long-tail layouts).
- Re-annotate everything into one clean Markdown style.
- Add synthetic pages for ultra-rare cases (like multi-page tables, nested formulas).
- Filter with rules and an LLM judge; refine hard cases using a better teacher model.
- Why it matters: Without this balance and consistency, models overfit to easy pages and keep breaking on tricky structures.
🍞 Anchor: Think of a practice binder that mixes easy essays, Chinese newspapers with vertical text, messy scans of receipts, and math worksheets—each labeled the same way. That’s the Data Factory.
This sets the scene: general VLMs understood content but ignored unbreakable structure rules. FireRed-OCR fixes that with better data and a training path that moves from seeing pixels to producing perfect, rule-following Markdown.
02Core Idea
Aha! In one sentence: Turn a general VLM into a structure expert by pairing a balanced, layout-aware data engine with a three-step training curriculum that ends with rule-enforced rewards.
Three analogies for the same idea:
- Sports training: First you learn basic moves (seeing and pointing to regions), then you practice full plays with a coach (clean Markdown outputs), and finally referees enforce the rules in scrimmages (rewards for closed tables and valid formulas).
- Baking: Start by prepping ingredients (detect text boxes), bake the cake using a standard recipe (unified Markdown), then use a checklist to pass quality inspection (RL rewards catching missing tags or crooked tables).
- Driving school: Learn to steer and park (pre-alignment), pass the written test on road signs (SFT formatting), then drive with a safety monitor that beeps if you break rules (GRPO rewards for structural compliance).
🍞 Hook: You know how teachers grade not only your answer, but also whether you showed your work neatly and followed the format?
🥬 The Concept: Three-Stage Progressive Training
- What it is: A learning journey that goes from seeing details to writing clean structure to strictly obeying format rules.
- How it works:
- Multi-task Pre-alignment: Learn detection, region reading, and basic layout-to-Markdown to ground vision in structure.
- Specialized SFT: Train on high-quality, unified Markdown so the model standardizes headers, lists, tables, and formulas.
- Format-Constrained GRPO: Reinforcement learning with rewards that penalize broken syntax and reward clean, accurate structure.
- Why it matters: Skipping steps leads to brittle behavior—good words, broken containers. The steps ensure the model first sees clearly, then writes cleanly, then obeys strictly.
🍞 Anchor: It’s like learning music: scales (pre-alignment), then sheet music reading (SFT), then performing with a metronome that buzzes when you go off-beat (GRPO).
Before vs After:
- Before: General VLMs guess structure and sometimes mess up tables or formulas.
- After: FireRed-OCR outputs consistent Markdown with correct reading order, rectangular tables, matching headers, and compilable formulas.
Why it works (intuition):
- Balanced data means the model meets (and survives) rare layouts during training.
- Pre-alignment ties words to places, reducing spatial confusion.
- SFT locks in a single, predictable syntax style so the model doesn’t waffle.
- GRPO uses immediate feedback (rewards/penalties) to discourage broken outputs.
- Iterating SFT and GRPO keeps content meaningful while format stays strict.
Building blocks (what’s inside the idea):
- Data Factory with dual indexing (geometry + semantics), stratified sampling, unified Markdown re-annotation, synthetic generators, rule filters, LLM audits, and expert distillation for the toughest cases.
- Multi-task Pre-alignment to connect pixels to regions and early layout structure.
- Specialized SFT to standardize long-form, full-page Markdown outputs.
- Format-Constrained GRPO to enforce rules for formulas, closures, tables, and text accuracy.
- Balanced mixture strategy during RL to avoid one task (like tables) hurting another (like plain text).
- Iterative SFT↔GRPO loop to prevent reward hacking and forgetting.
🍞 Hook: Think of a video game where you score points only if your castle walls are fully closed and your towers match a blueprint.
🥬 The Concept: Reinforcement Learning (RL)
- What it is: A way for models to learn by trying, then getting points for good moves and penalties for bad ones.
- How it works:
- The model generates several candidate outputs.
- A rule-checker scores them (good if tables are closed, bad if formulas don’t compile).
- The model updates itself to prefer higher-scoring outputs next time.
- Why it matters: Supervised learning copies answers; RL enforces rules by rewarding correct structure.
🍞 Anchor: The model writes three versions of a table; the only one with equal columns per row gets the points, so the model learns to make all rows align next time.
🍞 Hook: You know how board games have to follow exact rules, or the game falls apart?
🥬 The Concept: Format-Constrained GRPO
- What it is: An efficient RL method that compares a small group of model outputs to each other and rewards the ones that obey structure rules.
- How it works:
- Generate a few answers for the same page.
- Score each answer with rule-checkers: formula validity, tag closure, table rectangularity, text accuracy.
- Boost the model toward the better-scoring ones.
- Why it matters: It makes the model allergic to broken formatting, reducing structural hallucinations.
🍞 Anchor: If three students hand in homework, the one with correctly closed brackets, matching table columns, and accurate text gets the gold star—so next time, everyone imitates that style.
03Methodology
At a high level: Input (a document image) → Stage 1 (see and point: detection, region OCR, basic layout) → Stage 2 (write cleanly: unified Markdown) → Stage 3 (obey rules: RL with format rewards) → Output (pixel-precise, structurally valid Markdown).
Step 0: Data Factory (the training fuel)
- What happens: Pages are clustered by layout shape, tagged by language/source/genre, balanced to include rare layouts, re-annotated into one Markdown style, augmented with synthetic pages, filtered by rules and an LLM judge, and finally refined by a stronger teacher on hard cases.
- Why this step exists: Random data misses rare structures; mixed styles confuse the model; noisy labels teach bad habits.
- Example: A batch includes Chinese vertical-text newspapers, math-heavy worksheets with nested fractions, receipts photographed at an angle, and financial reports with multi-row headers—each with the same consistent Markdown style.
Step 1: Multi-task Pre-alignment (grounding vision)
- What happens (friend explanation):
- The model learns to both locate and read text (like pointing to a paragraph and saying what it says).
- It also gets early practice turning layouts into simple Markdown structures.
- Region prompts focus the model on small crops to sharpen local detail reading.
- Why this step exists: If the model can’t tie words to the right places, it will jumble the structure later.
- Example with data: Input is a form image; the model outputs bounding boxes with the words inside, plus a simple Markdown summary: headers, bullet lists for fields, and a small table.
🍞 Hook: You know how a treasure map helps you find the exact spot before you dig?
🥬 The Concept: Multi-task Pre-alignment
- What it is: Training the model to link visual spots to the text they contain and to sketch the basic document structure.
- How it works:
- Practice detection + reading together (find a box, read its text).
- Practice region OCR with crops or coordinates.
- Practice initial layout-to-Markdown mapping.
- Why it matters: It locks in the connection between where text is and what it says, reducing later mix-ups.
🍞 Anchor: The model sees a two-column page and learns to read left column first, then right, because it practiced mapping positions to order.
Step 2: Specialized Supervised Fine-Tuning (SFT) (standardizing output)
- What happens: Train on a curated set of high-quality, unified Markdown documents so the model uses one clean style for headers, lists, tables, and formulas.
- Why this step exists: Mixed styles make outputs unpredictable; standardized Markdown keeps everything consistent and parseable.
- Example with data: A scientific paper page becomes Markdown with exact header levels, a properly aligned table (same number of columns per row), and formulas written in the standard style used across the whole dataset.
🍞 Hook: Think of rewriting messy notes into a neat final draft with the same headings and bullet styles every time.
🥬 The Concept: Specialized SFT
- What it is: Coaching the model to produce one neat, agreed-upon Markdown style for full pages.
- How it works:
- Feed high-quality pairs: image → gold Markdown.
- Emphasize hierarchy (headers, lists), standard styling (bold/italic), and tables.
- Include many languages and complex layouts to build robustness.
- Why it matters: Uniform outputs are easier for tools to read and for the model to learn reliably.
🍞 Anchor: The same form from different sources always becomes the same Markdown pattern—so databases can import them without surprises.
Step 3: Format-Constrained GRPO (enforcing rules)
- What happens: The model generates multiple candidate outputs; rule-checkers score them for formula validity, matched tags, rectangular tables, and content accuracy; the model shifts toward higher-scoring answers.
- Why this step exists: Supervision trains style but doesn’t punish broken rules. RL adds real consequences to bad structure.
- Example with data: A math worksheet with fractions and limits: only the candidate that keeps all brackets matched and tables aligned—and whose text matches the content—gets rewarded.
🍞 Hook: You know how a spellchecker underlines mistakes, but a strict grader also takes points off? That second part is GRPO.
🥬 The Concept: Format-Constrained GRPO
- What it is: A group-based RL method that prefers candidates obeying structure rules.
- How it works:
- Produce a small group of outputs for the same prompt.
- Score each on formula compiles, tag closures, consistent table columns, and text similarity to reference.
- Update the model to reproduce the best-behaving outputs more often.
- Why it matters: It turns fragile formatting into a solid habit.
🍞 Anchor: On an invoice, the only answer that keeps 5 columns across every row and closes the table markers wins the points—so next time, that’s what the model does.
Secret sauce (what’s especially clever):
- Dual-indexed data balancing (geometry + semantics) so the model constantly sees rare, tricky structures.
- Unified Markdown re-annotation to eliminate style confusion from mixed datasets.
- Synthetic renderers to supply perfect labels for ultra-rare cases like multi-page tables and heavy-nesting formulas.
- A two-layer quality filter (rules + LLM judge) and “hard-case distillation” using stronger teacher models to fix the toughest samples.
- Iterative SFT↔GRPO cycles that keep content meaningful while tightening structure discipline.
- Balanced RL mixtures that avoid one modality (tables) sabotaging another (plain text).
04Experiments & Results
The test (what was measured and why):
- OmniDocBench v1.5: A broad test of document parsing that checks text accuracy, reading order, tables (TEDS), and formulas (CDM). This matters because real documents mix all of these.
- FireRedBench: A tough internal set with weird layouts (distorted scans, dense multi-columns, logic diagrams) to test robustness.
- OCRBench (Text): Focused on how cleanly text is read.
- PubTabNet (TEDS): Focused on table structure.
The competition (who it beat):
- Pipelines: PaddleOCR-VL-1.5, MinerU2.5, GLM-OCR.
- General VLMs: GPT-5.2, Gemini 3.0 Pro, Qwen3-VL-235B, Qwen3.5-397B.
- E2E OCR models: DeepSeek-OCR 2, dots.ocr, OCRVerse.
The scoreboard with context:
- OmniDocBench v1.5 Overall: FireRed-OCR scores 92.94%. Think of a class where most top students get around 88–91; FireRed-OCR gets a solid 93—a clear A.
- Text Edit: Best (lowest) error at 0.032 among E2E models, even surpassing some pipeline champions. That means very few character mistakes.
- Table TEDS: 90.31—like keeping every table row and column lined up across the page. It strongly beats many much larger VLMs.
- Reading Order: Top-tier (lowest edit), showing the model truly understands layout logic, not just words.
- OCRBench (Text): 93.5, higher than several massive general models.
- FireRedBench: 74.62—competitive with strong pipelines and notably more robust than many E2E baselines when layouts get nasty.
Surprising findings:
- Progressive annotation refinement: Starting with coarser labels, then moving to finer ones, works better than using only the finest labels from the start. It’s like learning big ideas first, then details—avoids getting stuck.
- Grounding constraints spill over: Applying format-constrained RL to grounding tasks (like coordinates) also improves the main page-generation task—discipline in “where” improves understanding of “what.”
- Balanced mixtures beat naive combos: Mixing RL data evenly across text, tables, and formulas performs better than just adding more of everything, because it prevents different tasks from fighting each other.
What this means practically:
- Structure stability: Closed tables, matched tags, and compilable formulas reduce the “it looks fine but won’t import” problem.
- Parameter efficiency: A ~2B-parameter specialist can beat or match huge generalists when the training is targeted and the data is balanced.
- Real-world reliability: From receipts to research papers, outputs are far more likely to be directly usable by tools and databases without manual fixing.
05Discussion & Limitations
Limitations (honest take):
- Data dependence: The model’s strength comes from balanced, unified data. Very novel layouts that differ wildly from training may still trip it up.
- Ultra-long documents: Extremely long, multi-page documents with many cross-references can stress memory and formatting consistency.
- Domain-specific quirks: Niche notations or rare scripts not covered by tags or synthetic templates might degrade quality until more examples are added.
- Formula corner cases: Exotic math packages or unusual spacing conventions may still cause occasional compile-like issues without further normalization rules.
Required resources:
- Good GPUs for multi-stage training, especially RL with group sampling.
- A data pipeline capable of clustering, tagging, re-annotating to unified Markdown, rendering synthetic pages, and running quality filters.
- Optional access to a stronger “teacher” model for distillation of the hardest samples.
When NOT to use it:
- If you only need plain text and don’t care about structure (no tables, no formulas, no reading order), a simple OCR engine might be faster and cheaper.
- If your domain has a unique schema (not Markdown-like) and strict proprietary formatting, you may need customization or a different target format.
- If latency is critical on tiny devices with minimal compute, a lightweight pipeline might be more practical.
Open questions:
- Can we generalize structure rules across brand-new document genres without re-training (few-shot or rule transfer)?
- How far can iterative SFT↔GRPO cycles go before diminishing returns, and what’s the optimal schedule?
- Can we automatically learn new reward functions from human preferences to capture subtle style norms?
- What’s the best way to handle multi-document packs (attachments, appendices) with consistent cross-page references?
- How to robustly support extremely low-resource scripts and rare math symbols without heavy manual template work?
06Conclusion & Future Work
Three-sentence summary: FireRed-OCR transforms a general VLM into an industrial-strength OCR specialist by pairing a Geometry + Semantics Data Factory with a three-stage progressive training strategy. The final GRPO stage uses format-constrained rewards so the model becomes allergic to broken tables, tags, and formulas. The result is state-of-the-art accuracy and structure stability on tough benchmarks, all in an efficient ~2B-parameter model.
Main achievement: Showing that disciplined data curation plus structure-enforcing RL can beat or match massive general VLMs on complex document understanding, delivering outputs that tools can actually consume.
Future directions:
- Expand the Data Factory to more languages, scripts, and niche document types; add smarter synthetic generators.
- Learn richer reward functions from human preferences to capture finer formatting tastes without handcrafting every rule.
- Improve long-document handling and cross-page consistency, including references, footnotes, and multi-page tables.
- Streamline on-device or low-latency variants for production at scale.
Why remember this: FireRed-OCR isn’t just a model—it’s a recipe for turning general models into structural experts. It proves that the right curriculum and rule-backed rewards can make smaller systems both accurate and reliable for real-world documents, from classroom worksheets to financial reports.
Practical Applications
- •Automate invoice and receipt capture with tables and totals imported cleanly into accounting systems.
- •Digitize research PDFs so formulas compile and tables align, enabling accurate search and analysis.
- •Process government forms and legal contracts with reliable reading order and intact section structure.
- •Batch-convert school worksheets and exams into standardized Markdown that grading tools can parse.
- •Extract robust tables from financial reports, preserving row/column spans for downstream analytics.
- •Archive historical newspapers with correct reading order, including vertical text in CJK languages.
- •Transcribe handwritten notes or forms with consistent Markdown structure for databases.
- •Build domain-specific document pipelines (medical records, lab reports) with unified formatting for EHRs.
- •Create training corpora from synthetic documents to cover rare layouts (multi-page tables, nested math).
- •Enable low-latency, on-prem OCR that still respects strict structure without massive models.