DODO: Discrete OCR Diffusion Models

Sean Man; Roy Ganz; Roi Ronen; Shahar Tsiper; Shai Mazor; Niv Nayman

DODO: Discrete OCR Diffusion Models

Beginner

Sean Man, Roy Ganz, Roi Ronen et al.2/18/2026

arXiv

Key Summary

•OCR is like reading a page exactly as it is, and that strictness makes it perfect for fast, parallel generation.
•Old OCR models wrote one token at a time (autoregressive), which is slow for long documents.
•Masked diffusion can write many tokens at once, but it used to stumble on exact order and length, which OCR cannot forgive.
•DODO fixes this by writing in chunks (blocks) that are anchored to what’s already confirmed, keeping order and length in line.
•Inside each block, tokens are filled in parallel using discrete diffusion, often locking in 10+ tokens per step.
•A faster variant, DODO fast, uses a KV-cache so finished blocks don’t get recomputed, tripling speed over autoregressive baselines.
•On OmniDocBench, DODO reached a low Normalized Edit Distance (0.066), close to top specialized OCR systems.
•Throughput reached about 63 tokens per second with DODO fast, about 3× the autoregressive baseline.
•Standard full-sequence diffusion failed on dense OCR, but block diffusion succeeded because it prevents length and alignment drift.
•This work shows diffusion can be both accurate and fast for rigid, exact-match tasks like OCR.

Why This Research Matters

Fast and accurate OCR powers everything from digitizing libraries to automating bills and forms. When systems like DODO read pages in parallel, apps feel snappier and companies save money on compute. Better throughput also means more accessible tools for people who rely on screen readers because text gets ready sooner and with fewer errors. Governments can process records faster, researchers can mine papers at scale, and students can turn class notes into searchable text instantly. By matching diffusion’s speed with OCR’s need for exactness, DODO makes high-volume document understanding more practical in the real world.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re copying a page from a textbook. You can’t change any words, you can’t skip lines, and everything must be in the exact same order. That’s what computers have to do in OCR.

🥬 Filling (The Actual Concept): — What it is: Optical Character Recognition (OCR) is how computers read text from images exactly as printed. — How it works (recipe):

Look at the picture of the page.
Spot where the letters and lines are.
Turn each shape into the correct letters and symbols.
Output the text in the right order. — Why it matters: If OCR gets even one symbol out of place, the result can be wrong or unusable, especially for forms, tables, and code.

🍞 Bottom Bread (Anchor): Scanning your school schedule should give the exact class names and times—no missed periods or swapped rows.

🍞 Top Bread (Hook): You know how a tour guide points at a landmark while explaining it? That’s like seeing and speaking at the same time.

🥬 Filling (The Actual Concept): — What it is: Vision-Language Models (VLMs) are AI systems that look at images and produce or understand words about them. — How it works:

A vision part turns the image into features.
A language part turns features into words.
Together, they match picture parts to token sequences. — Why it matters: Without VLMs, OCR would miss layout context (like where a table starts and ends) and mix things up.

🍞 Bottom Bread (Anchor): A VLM can see a two-column page and know to read left column first, then right.

🍞 Top Bread (Hook): Think of writing a sentence by saying one word, waiting, then saying the next, and so on.

🥬 Filling (The Actual Concept): — What it is: Autoregressive decoding predicts one token at a time, each depending on the previous ones. — How it works:

Output the first token.
Use it to guess the second.
Repeat until the sentence ends. — Why it matters: It’s reliable for order, but slow for long documents because it needs one forward pass per token.

🍞 Bottom Bread (Anchor): Reading a 10-page PDF word-by-word takes ages; any pause slows the whole thing.

🍞 Top Bread (Hook): Imagine a worksheet with many blanks; if you’re sure of several answers, you fill them all at once.

🥬 Filling (The Actual Concept): — What it is: Masked Diffusion Models (MDMs) learn to fill in missing tokens by repeatedly unmasking confident positions. — How it works:

Start with every token hidden (masked).
At each step, predict likely tokens for hidden spots.
Reveal the confident ones, keep uncertain ones masked.
Repeat until all tokens are filled. — Why it matters: This can be fast because many tokens are decided in parallel, but only works well if each token can be decided independently given the context.

🍞 Bottom Bread (Anchor): If an image clearly shows “Total: $23.45”, the digits can be filled together with high confidence.

🍞 Top Bread (Hook): You know how a jigsaw puzzle is easier when each piece you place doesn’t force you to change faraway pieces?

🥬 Filling (The Actual Concept): — What it is: Conditional independence means tokens can be decided separately once the image and context are known. — How it works:

Use the image to strongly constrain each token.
Predict multiple masked tokens at once.
Avoid conflicts because the input determines a single correct answer. — Why it matters: If independence holds, we can decode many tokens in parallel without creating nonsense.

🍞 Bottom Bread (Anchor): Reading a typed date like “2026-02-18” from a crisp scan—each character is determined by the pixels.

🍞 Top Bread (Hook): Imagine writing with ink that dries instantly—you can’t erase what you wrote earlier.

🥬 Filling (The Actual Concept): — What it is: Carry-over unmasking means once a token is revealed, it won’t be revised in later diffusion steps. — How it works:

Commit a token.
Treat it as fixed context.
Use it to predict the rest. — Why it matters: If an early token is wrong or misaligned, later steps can’t fix the structure; in OCR that can ruin the whole line.

🍞 Bottom Bread (Anchor): If you prematurely write [EOS] (the end token), the model may stop and cut off the rest of the page.

The world before DODO: Most OCR-capable VLMs used autoregressive decoding: accurate but slow for long pages. Standard masked diffusion promised parallel speed, but hit two big problems on rigid OCR: (1) Length mismatch—guessing too-short or too-long sequences causes truncations or padding hallucinations, and (2) Positional anchoring—placing a chunk at the wrong offset and being unable to shift it later because committed tokens can’t be edited.

The problem: OCR is rigid (there’s one correct answer), so small structural errors are fatal. Parallel diffusion without safeguards often fractures outputs when it mis-estimates length or absolute positions.

Failed attempts: Full-sequence masked diffusion for dense OCR struggled to converge; even with perfect (oracle) length information or late inference-time blocking, structural drift persisted because models weren’t trained to respect local anchors.

The gap: We need a way to keep the speed of parallel decoding but add the safety of causal order, so positions and length don’t drift.

What DODO adds: Block discrete diffusion. It splits the page’s token sequence into big blocks, decodes each block in parallel internally, but moves through blocks left-to-right, anchoring alignment at every block boundary. This preserves order and allows dynamic stopping without global length guesses.

Real stakes: Faster, reliable OCR means quicker digitizing of books, invoices, forms, and laws; better accessibility tools for screen readers; cheaper cloud costs for bulk scanning; and snappier apps on your phone that can transcribe documents in seconds, not minutes.

02Core Idea

🍞 Top Bread (Hook): Imagine cleaning a long hallway by teams: each team cleans its own section at the same time, and once a section is clean, you lock it so no dust can blow back. You finish fast and stay organized.

🥬 Filling (The Actual Concept): — What it is: DODO’s key idea is to decode OCR in big, left-to-right blocks using discrete diffusion inside each block, so many tokens are decided in parallel while block boundaries keep alignment and length under control. — How it works (high level):

Split the token sequence into contiguous blocks (like 256 tokens each).
For the current block, run masked diffusion: repeatedly reveal confident tokens in parallel.
Commit the block, anchor it as the unchangeable prefix.
Move to the next block, reusing the committed context (optionally via KV-caching), until [EOS]. — Why it matters: Without blocks, global parallel decoding can misplace or truncate text; with blocks, DODO stays fast and structurally correct.

🍞 Bottom Bread (Anchor): A two-column PDF is transcribed as a stream of tokens; DODO fills big chunks at once, keeps the stream aligned, and stops right on time at the end of each column.

Three analogies for the same idea:

Parade floats: Floats (blocks) roll one after the other, but people on each float (tokens) work in parallel to decorate. The parade stays in order.
Pizza assembly line: Each station (block) adds many toppings (tokens) at once, but the pizza moves left-to-right so cheese doesn’t end up under the crust.
Comic book printing: Each page (block) is laid out in parallel, then pages are bound in the right order so the story reads correctly.

Before vs After: — Before: Autoregressive OCR was orderly but slow: one token per step. Full diffusion was fast but fragile: any length or position slip could break the page. — After: Block diffusion is both speedy and sturdy: parallel inside blocks, causal across blocks.

Why it works (intuition, no equations): — OCR is low-entropy (nearly one correct answer). That makes token predictions almost independent given the image. So parallel filling is natural. — The trouble is not the token content—it’s the structure (length and offsets). Blocks add short-range causality, pinning down where text belongs as we go. — This keeps diffusion’s parallelism while stopping long-range drift.

Building blocks of the idea:

Parallel Decoding (inside blocks)
- You know how: Teamwork gets many chores done at once.
- What: Reveal many masked tokens simultaneously when confident.
- Why: Big steps cut total passes dramatically.
Block Discrete Diffusion
- You know how: Divide-and-conquer makes long tasks manageable.
- What: Decode 256-token windows via diffusion, one window after another.
- Why: Prevents global misalignment and supports dynamic stopping.
Causal Consistency (anchoring)
- You know how: Build a train by locking each car to the previous one.
- What: Once a block is done, it’s a fixed prefix the next block must follow.
- Why: Stops tokens from drifting or overlapping across distant positions.
KV-Caching (in DODO fast)
- You know how: Keep notes so you don’t re-solve the same math problem.
- What: Save key/value attention states for committed blocks.
- Why: Reuse past computations and triple throughput.
Confidence-driven unmasking
- You know how: Answer the easy quiz questions first; leave tricky ones for later.
- What: Reveal only tokens with high probability (e.g., $p ≥ 0$ .98).
- Why: Avoids locking in mistakes; speed comes from handling the obvious parts in big batches.

Anchor example: On a 148-token page, DODO finished in about 15 steps (≈10 tokens per step), with clear words early and tiny punctuation decided later—fast and exact.

03Methodology

At a high level: Document image → Visual features → Token canvas split into blocks → In-block masked diffusion (confidence thresholding) → Commit block (anchor) → Optional KV-cache reuse → Repeat for next block → Stop at [EOS] → Final transcript.

Step-by-step with the Sandwich pattern for each key component:

OCR token canvas 🍞 Hook: Imagine laying out blank lines where each character will go. 🥬 Concept: The model works on a sequence of tokens representing the page text.

How it works:
1. Serialize the page into a linear text format (plain text/HTML/LaTeX as needed).
2. Tokenize that string into vocabulary IDs.
3. Treat generation as filling a masked sequence.
Why it matters: Without a clear canvas, we can’t measure where tokens belong or when to stop. 🍞 Anchor: The title "Dose-Response Relationship..." becomes a series of tokens [Dose, -, Response, ...] placed in order.

Block Discrete Diffusion 🍞 Hook: You clean your room in zones: desk area, shelf area, floor area. 🥬 Concept: Split the long token sequence into contiguous blocks (e.g., 256 tokens) and decode one block at a time.

How it works:
1. Partition the sequence into blocks: x(1), x(2), ..., x(B).
2. For block b, run diffusion steps that iteratively unmask confident tokens.
3. Commit the finished block, then proceed to block b+1.
Why it matters: Without blocks, parallel decoding can misplace large chunks or mis-estimate total length; blocks anchor progress. 🍞 Anchor: Paragraph 1 is a block; once finalized, paragraph 2 starts exactly where paragraph 1 ends—no collisions.

In-block masked diffusion with confidence thresholding 🍞 Hook: When taking a test, you first fill answers you’re sure about and skip others until later. 🥬 Concept: Within a block, reveal only high-confidence tokens each step (e.g., $p ≥ 0$ .98), leaving uncertain tokens masked.

How it works:
1. Start the block fully masked.
2. At each step, compute probabilities for tokens.
3. Unmask all positions above the confidence threshold.
4. Repeat until the block has no masks or you meet a stopping rule.
Why it matters: Committing only sure tokens avoids early mistakes that can’t be undone. 🍞 Anchor: Clear words like “Published online:” lock in quickly; tiny punctuation or rare symbols are decided in later steps.

Causal consistency (prefix anchoring) 🍞 Hook: Building a Lego tower: once a layer is clicked in, you stack the next on top. 🥬 Concept: After a block is finished, its tokens become a fixed prefix for the next block.

How it works:
1. Treat previous blocks as immutable context.
2. The next block’s attention can see (and must align to) the committed prefix.
3. Stop the whole sequence when [EOS] appears at a block boundary.
Why it matters: Without this, blocks could slide around, breaking order. 🍞 Anchor: If block 1 ends with a section header, block 2 must start with that section’s first paragraph, not drift into a table title.

Two attention variants: DODO vs DODO fast — DODO (bidirectional inside the full sequence per step) 🍞 Hook: Imagine teammates who can both talk to you and listen to you while you work, so everyone updates together. 🥬 Concept: Full bidirectional attention recomputes prefix representations each step.

How it works:
1. Even fixed prefix tokens can attend to the current block.
2. Representations refresh every step for global consistency.
3. No KV-cache (prefix changes each step), but accuracy is strong.
Why it matters: Dynamic re-encoding helps with tricky layouts. 🍞 Anchor: While transcribing a figure caption, earlier tokens subtly adjust representations to fit the new context.

— DODO fast (block-causal, exact KV-cache) 🍞 Hook: Now imagine teammates who’ve finished their part and leave you neat notes—you don’t ask them again. 🥬 Concept: Block-causal attention freezes prefix states, enabling exact KV-caching.

How it works:
1. The active block can see all previous blocks, but previous blocks can’t attend to the active one.
2. Save the keys/values (KV) for the prefix once and reuse them.
3. Compute only the active block at each step—big speedup.
Why it matters: Without exact caching, recomputing long prefixes would be slow. 🍞 Anchor: For a 10-page document, you never re-encode already-decoded pages; you only process the current window.

KV-caching 🍞 Hook: Keep a bookmark in a long novel so you don’t reread from page one every time. 🥬 Concept: Store attention key/value tensors for finished blocks to skip recomputation.

How it works:
1. After committing a block, cache its KV.
2. For the next block, reuse cached KV as fixed context.
3. Only process new tokens.
Why it matters: Saves compute and boosts tokens-per-second dramatically. 🍞 Anchor: DODO fast reaches ~63 tokens/sec versus ~21 tokens/sec for autoregressive on the same backbone.

Training setup (essentials) 🍞 Hook: Practice with many pages so you get good at filling big chunks safely. 🥬 Concept: Train on OCR-heavy data with block-aware masks.

How it works:
1. Backbone: Qwen2.5-VL-3B.
2. Data: olmOCR-mix-1025 (~270K document-text pairs).
3. Max length: 8192 tokens; block size up to 256 for stability and speed.
4. Discrete diffusion objective with complementary masking and stratified time steps.
Why it matters: Training with blocks teaches the model to rely on local anchors; training full-sequence diffusion didn’t generalize. 🍞 Anchor: Ablations show vanilla full-canvas diffusion had high edit distance, even with oracle length.

Metrics and stopping 🍞 Hook: Grading a spelling test needs a fair score and a stopwatch. 🥬 Concept: Normalized Edit Distance (NED) for accuracy; Tokens Per Second (TPS) for speed.

How it works:
1. NED: Fewer edits needed to match ground truth means better quality (lower is better).
2. TPS: More tokens per second means faster inference.
3. Block-level [EOS] allows precise stopping without global length guessing.
Why it matters: OCR needs both correctness and low latency. 🍞 Anchor: On OmniDocBench, DODO hit NED 0.066 and DODO fast reached ~63 TPS.

The secret sauce:

Match the nature of OCR (deterministic, low ambiguity) with diffusion’s parallel power.
Add block-wise causal anchors to prevent structural drift.
Use confidence thresholding to commit only sure tokens, and exact KV-caching (in DODO fast) to triple throughput.
Scale block size (256) thanks to OCR’s clarity, unlocking large parallel gains without breaking alignment.

04Experiments & Results

The test: Evaluate how accurately and how quickly models transcribe dense documents.

Accuracy metric: Normalized Edit Distance (NED) on benchmarks OmniDocBench (layout-heavy) and Fox-Page-EN (pure text).
Speed metric: Tokens Per Second (TPS) during inference.

The competition:

Specialized OCR systems: dots.ocr, DeepSeek-OCR, MonkeyOCR, Mistral OCR, etc.
General autoregressive VLMs: Qwen2.5-VL family (same backbone as DODO’s base).
Diffusion VLMs: Dimple, LaViDa, LLaDA-V.

Scoreboard with context:

DODO (3B) on OmniDocBench: NED 0.066. That’s like getting a solid A when other diffusion models are failing the test (Dimple ~0.856, LLaDA-V ~0.524).
DODO is competitive with strong specialized OCR tools and surpasses its own AR backbone (Qwen2.5-VL-3B at 0.184 NED on OmniDocBench).
On Fox-Page-EN (text-heavy), DODO reaches NED 0.041; DODO fast is 0.059—good for a high-speed variant.
Throughput: $DODO ≈ 22$ .9 TPS (already matching or slightly beating cached $AR ≈ 21$ .0 TPS); DODO $fast ≈ 63$ .2 TPS (about $3× AR$ and far ahead of other diffusion VLMs).
Takeaway: DODO combines near-SOTA accuracy with best-in-class diffusion speed.

Surprising findings:

Full-sequence diffusion fails on dense OCR even with oracle sequence length or inference-time blocking. The problem is structural drift, not just length guessing.
DODO (without caching) can outperform a cached AR baseline because parallel decoding slashes the number of sequential steps needed.
There’s a sweet spot in block size: around 256 tokens for the bidirectional variant. Larger blocks (512, 1024) reintroduce anchoring issues; smaller blocks reduce parallel gains.
Approximate KV-caching with bidirectional attention collapses accuracy; exact KV-caching needs block-causal training (DODO fast) to work.

What the numbers mean in plain language:

Accuracy: Lower NED means the transcript almost exactly matches the ground truth. DODO’s 0.066 is far closer to perfect than prior diffusion VLMs on tough layouts.
Speed: If AR must take 100 steps for 100 tokens, DODO might take ~10–15 steps for the same output by locking in many tokens per step.

Why confidence thresholding matters:

A p = 0.98 threshold kept accuracy high while still giving big speedups. Less strict settings were faster but made too many typos for practical OCR.

Visuals (from the paper):

DODO resolves big, certain regions early and leaves tiny punctuation for later steps. Heatmaps of token commitment show broad, parallel progress with fine details refined at the end.

Overall: The experiments show that OCR’s “one true answer” nature matches perfectly with block diffusion—parallel where it’s safe, causal where it counts.

05Discussion & Limitations

Limitations:

Speed–accuracy trade-off: DODO fast is much quicker but has higher edit distance than the bidirectional DODO because its prefix is frozen.
Sensitivity to structure: Results depend on choosing the right block size and attention mask. Too big and you risk internal misalignment; too small and you lose parallel speed.
Revisability: Carry-over unmasking still means no edits to committed tokens. While blocks reduce risk, catastrophic early errors inside a block can persist until the next boundary.
Domain dependence: The approach shines when the output is nearly deterministic (OCR). It may not transfer as well to creative or ambiguous tasks like captioning.

Required resources:

Training: Multi-GPU (e.g., $8× A100$ 40GB), long-context training (up to 8192 tokens), diffusion objectives, and OCR-focused datasets.
Inference: For DODO fast, exact KV-cache handling and block-causal masks; for DODO, more compute per step since the prefix is recomputed.

When NOT to use:

Open-ended generation (multiple valid answers), like storytelling or free-form image captions: independence breaks, parallel fills may conflict.
Very short texts where AR latency is negligible: the engineering overhead of block diffusion may not pay off.
Extremely noisy scans with ambiguous glyphs: if confidence stays low, parallel gains shrink and error risk rises.

Open questions:

Can we blend DODO and DODO fast—allow some controlled prefix refresh while keeping most caching benefits?
Can smarter samplers (beyond plain confidence thresholding) push speed without sacrificing accuracy—e.g., adaptive thresholds per token type (digits vs punctuation)?
Can we add lightweight, within-block revision steps (micro-rollback) to fix rare early mistakes without losing diffusion’s ELBO benefits?
How well does the approach generalize to multilingual OCR, handwriting, or math-heavy LaTeX beyond current datasets?
Can layout-aware serialization (e.g., small HTML tags) further stabilize alignment without bloating token length?

Bottom line: DODO nails the structure-speed balance for OCR, but the best recipe (block size, mask style, sampler) still needs careful tuning across domains.

06Conclusion & Future Work

Three-sentence summary:

DODO turns OCR into a fast, reliable block-by-block diffusion process, decoding many tokens in parallel while keeping alignment locked at each boundary.
This solves the biggest weakness of full-sequence diffusion for OCR—length and position drift—without falling back to slow, token-by-token autoregression.
The result is near-state-of-the-art accuracy with up to $3× higher$ throughput than strong autoregressive baselines.

Main achievement:

Proving that discrete, block-anchored diffusion can be a practical, high-throughput alternative to autoregression for rigid, exact-match tasks like OCR.

Future directions:

Hybrid attention that refreshes crucial parts of the prefix on demand while retaining most KV-caching speed.
Task-tailored samplers and dynamic block sizing that adapt to local difficulty (e.g., shrink blocks near tables or formulas).
Extension to multilingual, handwriting, and formula-rich documents with specialized tokenization and serialization.

Why remember this:

DODO shows a new path: when the world has one right answer (like exact text on a page), diffusion can be both fast and correct—if you give it blockwise safety rails.
It reframes diffusion from a creative tool to a precision instrument for deterministic vision-language tasks.
For anyone building document AI, this is a blueprint for turning parallelism into real-world latency wins without sacrificing fidelity.

Practical Applications

•High-speed scanning of multi-page PDFs into accurate text for search and archiving.
•Automated invoice and receipt processing with lower latency in accounting workflows.
•Rapid e-discovery: transcribe legal documents quickly for keyword search and review.
•Digitizing textbooks and research papers for accessible formats (screen readers, Braille).
•On-device mobile scanning apps that transcribe documents faster with less battery drain.
•Form processing in call centers or banks where exact field values must be read correctly.
•Bulk digitization of historical archives where throughput and accuracy cut costs.
•Real-time transcription of technical manuals with tables and formulas for field engineers.
•Faster data extraction for RPA pipelines that depend on OCR upstream.
•Preprocessing step for multimodal reasoning systems that require exact text from images.

Version: 1