ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images
Key Summary
- âąExStrucTiny is a new test (benchmark) that checks if AI can pull many connected facts from all kinds of documents and neatly put them into JSON, even when the question style and schema change.
- âąIt blends ideas from key-entity extraction, relation extraction, and visual question answering to better match real office tasks like processing forms, slides, reports, and web pages.
- âąEvery answer must include the exact text, the page number, and a bounding box of where that text came from, so models must both read and point.
- âąThe dataset has 304 queryâanswer pairs across 110 multi-page documents, with three query types: closed with plain text, closed with a schema, and on-demand (vague) requests.
- âąTo build it, the authors combined careful human-made examples with many synthetic ones from a strong VLM, then had experts fix and validate them.
- âąThey also invented a fair scoring method that uses a small text-only LLM to align different JSON shapes before computing accuracy, so models arenât punished for harmless formatting differences.
- âąClosed-source models currently win by a big margin on text extraction (about 18+ ANLS points over the best open model) and stay strong even when many values are requested.
- âąAll models struggle to precisely point to answer locations (low bounding-box IoU), showing a gap between âgetting the right textâ and âproving where it came from.â
- âąModels perform worse on harder query types (schema-heavy and on-demand), on reformulated questions with fewer word matches, and when answers are missing (unanswerable queries).
- âąVisual information clearly helps: using OCR text alone drops performance by about 10% ANLS compared to using the document images.
Why This Research Matters
Many everyday processesâpaying bills, approving loans, onboarding patients, or tracking shipmentsâdepend on quickly and accurately reading mixed documents. ExStrucTiny checks whether AI can adapt to different question styles and changing schemas, which mirrors how real users ask for information. By requiring exact locations (page and boxes), it supports trust, auditing, and compliance needs where proof matters. The benchmarkâs tough cases (low word-overlap, missing answers, multi-entity requests) discourage shortcuts and push true understanding. Findings reveal where current models fall shortâespecially groundingâso engineers know what to fix. As models improve on ExStrucTiny, businesses can safely automate more of their document workflows, saving time and reducing errors.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Hook: Imagine your class has a giant folder full of different papersâpermission slips, lunch menus, report cards, and postersâand your teacher asks you to fill a spreadsheet with names, dates, and totals from all of them. Doing this by hand would take forever and you could still make mistakes.
đ„Ź The Concept (Structured Information Extraction): What it is: Itâs teaching computers to find important bits (like names, dates, amounts) in documents and put them into a tidy structure (like JSON) so other programs can use them. How it works (recipe):
- Look at the document (image + text + layout).
- Find the parts the user asked for (the âentitiesâ).
- Copy the exact text and also record where it was found (page and box).
- Place all of that into the right spots in a structured answer. Why it matters: Without structure, computers canât easily search, check, or combine the data; itâs like dumping puzzle pieces in a bag without building the picture. đ Anchor: A company receives scanned invoices and needs âinvoice number,â âvendor,â and âtotal.â Structured extraction fills those cells automatically.
đ Hook: You know how some worksheets always ask the same few thingsâlike your name and dateâno matter the subject?
đ„Ź The Concept (Closed IE and Narrow Ontologies): What it is: Closed information extraction asks for a fixed set of entities (like always âname,â âdate,â âamountâ). How it works: Models train on one document type with a short list of target fields; they tag and extract those exact fields. Why it matters: It works great when you always see the same form, but it struggles when the form or the target fields change. đ Anchor: A receipt dataset might always ask for âstore nameâ and âtotal,â which is perfect for receipts but not for slide decks.
đ Hook: Have you ever tried to answer a question about a picture and realized the question can be simple (âWhat color?â) or tricky (âWhich item is most expensive?â)?
đ„Ź The Concept (VQA vs. Real-World Extraction): What it is: Visual Question Answering (VQA) asks a question about an image and expects an answer, often a short text span. How it works: The model reads the image and text, matches the question words to likely spots, and returns the answer. Why it matters: Many VQA tasks use simple, single answers and often share lots of words with the document, so string matching can sometimes cheat; real business tasks need multi-field, structured outputs. đ Anchor: âWhatâs the invoice number?â is easy; âList all line items with names, quantities, and pricesâ is the real-world challenge.
đ Hook: Suppose your teacher sometimes gives you a blank table to fill (schema), sometimes writes a sentence, and sometimes just says, âFind all info about the authors.â
đ„Ź The Concept (Open, Closed, and On-Demand Queries): What it is: Three ways people ask for info.
- Closed (plain text): âGive me X and Y.â
- Closed (with schema): âFill this JSON with X and Y.â
- On-demand: âGet all info about Zâ (you must decide which children fields matter). How it works: The model must read both the request style and the document, then adapt the output. Why it matters: Real users donât always know exact field names. A good system must flex with any schema or vague request. đ Anchor: âExtract signer ID and roleâ (closed), â[{"signer name":"","signer role":""}]â (schema), âAll details about the signersâ (on-demand).
đ Hook: Think of a backpack that can carry books (text) and a water bottle (images) at once.
đ„Ź The Concept (Vision-Language Models, VLMs): What it is: VLMs understand both pictures and words together. How it works: They encode images and text, connect them with attention, then generate or extract answers. Why it matters: Documents are visual (layout, tables, charts) and textual. Ignoring either side loses key clues. đ Anchor: A VLM can read a chartâs legend (visual) and labels (text) to correctly extract the value of the red bar.
Before this paper: Most datasets focused on simple, fixed lists of things or single short answers, often from one document type. That was fine for narrow tasks but didnât test flexible, multi-entity, structured extraction across many document styles. The problem: General-purpose VLMs need to handle various documents and changing schemas, but we lacked a fair, realistic test to measure this. Failed attempts: KEE datasets used small, fixed ontologies; many VQA sets asked single, easy questions with high word overlapâmodels could answer by matching strings, not true understanding; table-only tasks ignored full document context. The gap: No benchmark required models to: 1) adapt to user-provided schemas, 2) extract many related fields at once, 3) include exact locations, and 4) survive low word-overlap and missing-answer cases across diverse document types. Real stakes: In banks, hospitals, and schools, lots of workflows depend on correctly reading mixed documents. If the AI canât adapt, humans must fix things by hand, slowing everything down and risking errors.
02Core Idea
đ Hook: You know how a universal remote can control lots of different TVs and speakers because it adapts to each deviceâs buttons? Imagine a âuniversal testâ that checks whether an AI can adapt to any document and any set of fields you ask for.
đ„Ź The Concept (The Aha!): What it is: ExStrucTiny is a benchmark that unifies closed, schema-based, and on-demand extraction on real document images and scores models on both what they extract and where they found it. How it works:
- Provide diverse documents (forms, reports, slides, web pages) and three query styles (closed plain text, closed schema, on-demand).
- Require answers in JSON with exact text, page, and bounding boxes for every extracted value.
- Use a smart âschema mapperâ (a small text-only LLM) to align differently-shaped JSON outputs before scoring, so fair comparison doesnât depend on naming or nesting.
- Evaluate text accuracy, structure similarity, and grounding (page and box). Why it matters: Without this, we could wrongly judge models just because they used a different but correct JSON shapeâor because the question was flexible and they didnât adapt. This benchmark checks real skills businesses need. đ Anchor: A user asks, âExtract all details about the signers.â The model must discover the children fields (like name, role, date), pull the exact strings, show where they came from, and organize them in a neat list of objects.
Multiple analogies:
- Swiss Army Knife Test: Not just âCan you cut?â but âCan you cut, open, twist, and file?â ExStrucTiny tests many extraction skills at once.
- Treasure Map with GPS: Donât just bring the treasure (text). Show the GPS coordinates (page + box) to prove where you found it.
- Build-Your-Own Shelf: Sometimes we hand you the shelf blueprint (schema); sometimes we just say âstore author stuffâ and you must decide which compartments (fields) to add.
Before vs After:
- Before: Datasets often had single answers, fixed fields, and high word-overlap.
- After: ExStrucTiny demands multi-entity, schema-flexible, low-overlap, and sometimes unanswerable queries, across varied document types, with location evidence.
Why it works (intuition):
- Flexible queries + required JSON leaves (text, page, bbox) force models to both understand and ground answers.
- The LLM schema-mapper removes grading unfairness from different-but-equivalent JSON shapes.
- Low lexical overlap and unanswerable cases prevent âstring matchingâ shortcuts and check true comprehension.
Building blocks (each explained with Sandwich):
- đ Hook: Like labeling each Lego piece you use in a build. đ„Ź The Concept (Extraction Leaves): What it is: Each final value comes with its text, page, and bounding box. How: For every field, store {"text", "page", "bbox"}. Why: So we know exactly what you used and where it came from. đ Anchor: âTotal: $123.45,â page 2, box [120, 540, 280, 575].
- đ Hook: Different people organize binders differently. đ„Ź The Concept (Schema Mapping LLM): What it is: A small LLM aligns your JSON shape to the gold shape. How: Flatten keys, match by values and meaning, then compute scores. Why: Prevents penalizing correct answers just for using different key names. đ Anchor: âbuyer.nameâ can map to âcustomer.full_nameâ if the value matches.
- đ Hook: A good report card doesnât just say âgoodâ; it shows subject grades. đ„Ź The Concept (Multi-part Scoring): What it is: Separate scores for text similarity, structure match, page correctness, and box overlap. How: Compute ANLS for text, tree-edit for structure, page accuracy, and IoU/proximity for boxes. Why: One number can hide weaknesses; multiple scores show where models need help. đ Anchor: A model could get the right number but mark the wrong pageâtext score high, page score low.
03Methodology
At a high level: Document images â Query (closed, schema, or on-demand) â Model outputs JSON with extraction leaves â Schema mapper aligns predicted vs. gold â Metrics computed (text, structure, page, bbox).
Step 1: Task setup (queries and answers)
- What happens: Each example has multi-page images and one query of three types: closed with plain text, closed with a schema, or on-demand (vague parent field). All answers must be JSON, and every final value must be an extraction leaf with {"text", "page", "bbox"}.
- Why this step: Forces consistent, structured results across many styles of asking.
- Example: Query: â[{"signer name":"","signer role":""}]â. Answer: a list of signer objects, each field holding a leaf with exact text, page index, and normalized bbox.
Step 2: Manual seed data
- What happens: Experts annotate 102 high-quality examples across four sources: forms (FUNSD), financial reports (TAT-DQA), slide decks (SlideVQA), and web pages (VisualMRC). They design hard queries with multiple entities, low word-overlap, missing fields, cross-page values, and tricky layouts (like charts, checkboxes).
- Why: Sets the gold standard for difficulty and realism; provides strong few-shot examples.
- Example: âExtract cost center ID, department, and any events (type and date).â Some fields might be missing; answers include null leaves where data doesnât exist.
Step 3: Synthetic expansion with a strong VLM
- What happens: Using Gemini-2.5-Flash-Thinking, the team generates many more QAs, including extra-hard ones. They add chain-of-thought in the prompt for better structure-following and use higher temperature for diversity. Then they add targeted augmentations:
- Reformulations: rename entities to reduce word overlap.
- Unanswerables: request fields that donât exist (fully or partially).
- Why: Scales up variety and difficulty, mirroring real-life queries.
- Example: Change âstudy nameâ to âreport topic,â or add âproject sponsorâ when none exists.
Step 4: Human validation with editing
- What happens: Experts validate 202 synthetic QAs, fixing queries, text values, pages, and boxesâand ensuring every leaf is correctly formatted. Average 25.5 edits per QA; only 2 rejected.
- Why: Ensures correctness and standards compliance.
- Example: If a box was slightly off, a validator tightens it; if a value didnât match the document exactly, they correct the string.
Step 5: Final dataset composition
- What happens: Combine 102 manual + 202 validated synthetic = 304 QAs over 110 documents. Keep a mix of difficulties: ~55% basic, ~25% reformulated, ~15% partially unanswerable, ~5% fully unanswerable.
- Why: Balanced, realistic test-bed for generalist extraction.
- Example: A slide deck query might ask for all authorsâ details (names, titles, emails) across multiple pages.
Step 6: Fair evaluation via schema mapping
- What happens: Models often produce valid but differently-structured JSON. To grade fairly, ExStrucTiny flattens gold and prediction trees and uses a small reasoning LLM (gpt-oss-20b) to map gold keys to predicted keys, with near-perfect mapping F1 (~0.976 in tests).
- Why: Prevents âformat fightsâ from hiding true extraction quality.
- Example: â0.person.nameâ can match âauthors.0.full_nameâ if the strings match closely.
Step 7: Metrics
- What happens: Compute:
- Text similarity (ANLS) per matched leaf.
- Page accuracy.
- Box overlap (IoU) and proximity.
- Structure similarity (tree-edit distance).
- Why: Breaks the problem into âread correctly,â âfound the right place,â and âorganized it well.â
- Example: If a model copies the right total but from the wrong page, text score is high but page score is low.
The secret sauce:
- Requiring extraction leaves (text + page + bbox) prevents shortcutting and enables grounding checks.
- The schema-mapper LLM makes evaluation robust to naming and layout differences.
- The dataset design (multi-entity, low overlap, unanswerables, multiple doc types) pushes beyond string-matching into true understanding.
Concept mini-lessons (Sandwich style):
- đ Hook: Like returning a library book with its exact shelf address. đ„Ź The Concept (Answer Localization): What: Prove where each answer comes from (page and box). How: Include page index and normalized bbox for every value. Why: Trust and verification matter for real workflows. đ Anchor: Total amount = â$210,910,â page 3, bbox [410, 612, 580, 640].
- đ Hook: Sometimes you and a friend organize notes differently but mean the same thing. đ„Ź The Concept (Tree-Edit Distance): What: A way to compare two JSON shapes. How: Count minimal edits to turn one structure into the other. Why: Rewards matching organization, not just matching words. đ Anchor: Two nested lists of signers vs. a flat list with index tags can still be structurally close.
04Experiments & Results
The test: Evaluate many open- and closed-source VLMs of different sizes on ExStrucTiny with three few-shot examples (one per query type). Measure text extraction (ANLS), structure similarity, and grounding (page, bbox).
The competition: Open models (e.g., Qwen2.5-VL, Gemma-3, Pixtral, Mistral-Small, Kimi-VL) vs. closed models (Gemini-2.5 family). Inference used consistent settings and the schema-mapper for fair scoring.
Scoreboard with context:
- Text extraction (ANLS): Closed models lead. The best closed model (Gemini-2.5-Pro) averages ~79.5% ANLS overall and reaches ~81.2% on the manual subset; the best open model (Qwen2.5-VL-72B-FP8) averages ~61.4% ANLS. Thatâs like an A- vs. a solid C.
- Query type difficulty: Closed-with-schema and on-demand queries score lower than closed plain text across almost all modelsâbecause schema queries request many more fields, and on-demand queries force the model to infer the right children fields.
- Size helps: Within model families, bigger models do better (e.g., Qwen2.5-VL-3B â 72B shows large gains).
- Extraction length hurts open models: As the number of required values climbs (50+), open modelsâ scores drop sharply, while closed models stay steadier.
- Manual vs. synthetic: Manual QAs are ~13.6% harder on average; no evidence of favoritism toward Gemini-2.5-Flash despite generating synthetic items, likely thanks to rigorous human validation.
- Reformulations and unanswerables: On synthetic data, ANLS â 62.4 for basic, â 56.6 for unanswerables, and â 45.7 for reformulated queriesâshowing sensitivity to wording changes and to correctly returning null for missing fields.
- Visual vs. text-only: A text-only baseline (OCR text without images) scores ~10% ANLS lower than the image-based run, proving layout/visual cues matter.
Grounding results:
- Page accuracy: Best â 84.3% (closed).
- Bounding-box IoU: Low across the board (best â 14.4%), meaning models often get the text right but not the exact rectangle.
- Box proximity: Best â 74.8%, suggesting models get ânearbyâ but not âperfectly overlapping.â Interpretation: Thereâs a gap between extracting correct text and precisely pointing to itâcrucial for audits and compliance.
Structure and validity:
- Valid JSON and leaves: Larger models tend to produce more valid extraction leaves; some smaller ones output JSON that runs but with too few correctly formed leaves.
- Schema mapping recall: Closed models recall more requested entities (Gemini-2.5-Pro/Flash â 88%/87%). Big open models improve with scale.
- Structure similarity: Several large open models approach closed models on tree similarity, showing that organizing outputs well is learnable with scale.
Context breakdown:
- Hardest contexts: Charts and dense free text. Many models register their lowest scores there, aligning with known challenges in chart understanding and distraction from surrounding content.
Surprises:
- Even when text is correct, grounding lagsâso âanswer foundâ â âevidence precisely located.â
- Reformulating entity names (lower word overlap) hurts a lot, signaling over-reliance on keyword matching.
- Unanswerable detection remains tough; models still try to âhallucinateâ answers instead of returning null leaves.
05Discussion & Limitations
Limitations:
- ANLS is character-based and not ideal for numbers and dates (about 26% of values). A model can be âoff by one digitâ but still usefulâor vice versaâyet ANLS doesnât capture that nuance.
- English-only: The benchmark doesnât yet test multilingual extraction.
- Schema-mapper dependency: Using a text-only LLM to align schemas is accurate but slower than pure code and not 100% perfect.
Required resources:
- A VLM with multi-page image handling and JSON generation.
- GPU memory for inference at scale (the authors used four NVIDIA L40S GPUs).
- The schema-mapper LLM (e.g., gpt-oss-20b) for evaluation.
When NOT to use:
- If you only need a few fixed fields from one static formâtraditional KEE datasets and specialist models may be simpler and faster.
- If perfect pixel-level bounding boxes are mandatory todayâcurrent modelsâ IoU is low.
- If your use case is multilingual right nowâExStrucTiny is currently English-only.
Open questions:
- Better grounding: How do we tie text decoding and box prediction so âwhatâ and âwhereâ improve together?
- Metric design: Can we add number/date-aware scoring and partial-credit rules that reflect business utility?
- Robustness: How can we make models resilient to paraphrases (low word overlap) and confidently return null for missing fields?
- Efficiency: Can we speed up schema alignment without losing fairness?
- Coverage: What happens with more domains (legal, medical), more layouts (handwriting), and more languages?
06Conclusion & Future Work
Three-sentence summary: ExStrucTiny is a realistic benchmark that asks models to extract many related facts from varied document images and to return them in flexible JSON formats with exact locations. It fairly scores both âwhat you foundâ and âwhere you found it,â even when models choose different but equivalent JSON structures. Results show closed models currently lead, bigger models help, and everyone struggles with precise grounding, paraphrases, and unanswerables. Main achievement: A unified, schema-variable, grounding-aware benchmark plus a fair evaluation framework (with schema mapping) that mirrors real business extraction needs. Future directions: Add multilingual data, number/date-aware metrics, stronger grounding training, and broader document types (e.g., handwritten, stamps, signatures). Explore faster, programmatic schema alignment and richer few-shot guidance. Why remember this: ExStrucTiny moves the field from toy questions toward the messy, structured, and verifiable extractions that real workflows demandâpushing models to understand, adapt, and prove their answers.
Practical Applications
- âąInvoice processing: Extract vendor, invoice number, due date, and totals with page-and-box evidence for audits.
- âąForm intake: Pull patient or customer details (including checkboxes) into structured records with nulls for missing fields.
- âąReport mining: Gather all occurrences of metrics (e.g., revenue by quarter) from long financial reports and link them to their exact pages.
- âąSlide summarization for compliance: List authors, titles, and affiliations from multi-slide decks, with grounding for each value.
- âąWebpage capture: Extract product specifications from screenshots, preserving the field structure and locations.
- âąQuality control: Flag unanswerable fields explicitly (null leaves) to avoid hallucinations in automated pipelines.
- âąWorkflow routing: Use schema-flexible extraction to populate different downstream APIs that expect different JSON shapes.
- âąAnalytics preparation: Aggregate multi-entity extractions (e.g., all line items) for dashboards, keeping provenance for traceability.
- âąRPA integration: Drive robotic process automation steps using structured outputs that include where to click (approximate boxes).
- âąDataset curation: Generate and validate synthetic but realistic QAs to extend internal testing for new document types.