Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

Hanna Yukhymenko; Anton Alexandrov; Martin Vechev

Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

Intermediate

Hanna Yukhymenko, Anton Alexandrov, Martin Vechev2/25/2026

arXiv

Key Summary

•The paper builds an automated pipeline that translates AI benchmarks and datasets into many languages while keeping questions and answers correctly connected.
•It fixes common translation problems like meaning drift and answer leaks caused by grammar (for example, gender endings giving away the right choice).
•The pipeline uses test-time compute ideas—Self-Check, Best-of-N, USI (combine strengths), and T-RANK (multi-round ranking)—to pick or improve the best translation.
•T-RANK reduces judging bias by rotating translation candidates across positions so the model doesn’t always favor the first one.
•Across Ukrainian, Romanian, Slovak, Lithuanian, Bulgarian, Turkish, Greek, and Estonian, the new translations beat popular baselines like Global-MMLU, Okapi, and MuBench.
•Quality checks with COMET and an LLM-as-a-judge show higher translation quality and more trustworthy evaluation scores.
•Better translations change model scores on benchmarks (often by several percentage points), proving that benchmark quality strongly affects how we judge models.
•USI tends to shine on shorter, simpler texts, while T-RANK is especially good for tricky benchmark questions that can lose structure during translation.
•The framework is configurable (dataset vs benchmark mode) and supports open-weight or closed models, balancing cost and quality.
•All code and improved benchmarks are released to encourage fairer, reproducible multilingual AI.

Why This Research Matters

Accurate translations make AI tests fair across languages, so we truly compare models on skill, not luck. When we keep questions and answers aligned, we prevent hidden hints that inflate scores and mislead research. This framework helps communities with mid-resource languages get reliable tools and evaluations, improving inclusivity in AI. It reduces the need for costly manual translation while still catching subtle grammar and terminology issues. Developers can ship multilingual features with more confidence because benchmark scores better reflect real performance. Educators, governments, and companies can trust that AI systems work equally well in Ukrainian, Greek, Turkish, and beyond.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you and your friends are building a giant quiz that works in many languages. If the questions and answers get mixed up during translation, players might score points for the wrong reasons.

🥬 Filling (The Actual Concept) — Machine Translation:

What it is: Machine translation lets computers turn text from one language into another.
How it works: 1) Read the source sentence, 2) Predict the meaning, 3) Generate the target sentence that matches the meaning and style, 4) Check for grammar and terms.
Why it matters: If translations are sloppy, the meaning can drift, and people (or models) are judged unfairly.

🍞 Bottom Bread (Anchor): Turning “The cat sat on the mat” into Spanish as “El gato se sentó en la alfombra” keeps the same idea and sounds natural.

🍞 Top Bread (Hook): You know how tests in school must be fair? If one version of the test secretly hints at the answer, that’s not fair at all.

🥬 Filling — Benchmark Evaluation:

What it is: Benchmark evaluation is how we test AI fairly with the same questions across languages.
How it works: 1) Prepare the same task in each language, 2) Give it to models, 3) Compare scores, 4) Decide what’s better.
Why it matters: If translations change the difficulty or reveal answers, the scores lie about how smart the model really is.

🍞 Bottom Bread (Anchor): If the English quiz says “Choose the right ending” but the translated version gives away the answer due to grammar, a model can ace it without real reasoning.

The world before: Many multilingual benchmarks were translated with older tools or in a way that broke the original structure. Often, questions and answers were translated separately. In languages with rich grammar (like Ukrainian, Greek, or Turkish), that can cause trouble: a single word ending might reveal which answer is right. This is called answer leakage. Other times, small choices—like picking the wrong scientific term—shift the meaning (semantic drift). As a result, models could look smart just because the translation helped them cheat, or look weak because the translation made things harder than in English.

🍞 Top Bread (Hook): Imagine hearing a story in pieces—first the ending, then the middle, then the beginning. It’s easy to get confused!

🥬 Filling — Preserving QA Structure:

What it is: Keeping the question and all answer options together during translation so their grammar and meaning fit each other.
How it works: 1) Translate question and answers in one prompt, 2) Keep roles and forms aligned (like gender and case), 3) Double-check that none of the answers leak clues accidentally.
Why it matters: Without this, grammar mismatches can either hide the right answer or make it pop out.

🍞 Bottom Bread (Anchor): In a fill-in-the-blank sentence, using a verb form that only matches one noun option can reveal the answer. Good translation avoids that.

Failed attempts and gaps: Projects like MuBench, Global-MMLU, and Okapi made big strides, but they often relied on: (1) classic machine translation tools that don’t handle instruction-style prompts, (2) older LLMs with weaker multilingual skills, (3) translating parts separately, and (4) limited or uneven human checks. There was no easy, flexible way to choose methods based on language difficulty or benchmark type, and little focus on preserving the exact testing structure.

Real stakes: If we judge models with broken translations, we pick the wrong models, ship the wrong features, or draw wrong research conclusions. That affects students using AI tutors, patients reading translated instructions, and communities whose languages deserve accurate representation. The paper’s goal is to make translated benchmarks both scalable and trustworthy, especially for mid-resource Eastern and Southern European languages where grammar can change meaning in subtle ways.

🍞 Top Bread (Hook): Think of a chef trying several recipes before choosing the best one for a cooking contest.

🥬 Filling — Test-time Compute Scaling (general idea):

What it is: Using extra “thinking” at answer time by generating multiple candidates and picking or improving the best.
How it works: 1) Sample multiple translations, 2) Compare or combine them, 3) Output the winner, 4) Optionally refine further.
Why it matters: Without extra thinking, a single shot might miss key nuances, especially in tricky grammar.

🍞 Bottom Bread (Anchor): Asking three friends to write their best version of a paragraph, then merging their best parts, usually beats taking the first draft from just one friend.

This paper introduces a configurable, automated pipeline that treats translation as a careful competition and collaboration among candidate translations. It preserves question–answer structure, adapts prompts to language quirks, and uses judging steps to rank, fuse, and refine results. The framework balances cost and time while raising translation quality—leading to fairer, more accurate multilingual model evaluation.

02Core Idea

🍞 Top Bread (Hook): Imagine you have five drawings of the same scene. Instead of picking the first one you see, you compare all of them, choose the best parts, and even improve the winner.

🥬 Filling — The Aha! Moment:

What it is: Treat translation like a game where multiple candidate translations compete, a judge ranks or fuses them, and the final answer is refined—while always keeping questions and answers together.
How it works: 1) Generate several translations at a higher creativity setting, 2) Either rank them (T-RANK) or merge their strengths (USI), 3) Refine the winner, 4) Preserve QA structure and language-specific grammar.
Why it matters: Without this, subtle grammar issues cause answer leakage or meaning drift, making benchmark scores untrustworthy.

🍞 Bottom Bread (Anchor): Translating a multiple-choice science question into Ukrainian with all options in one go keeps the grammar aligned, so none of the options accidentally give away the answer.

Three analogies:

Team of editors: Many editors write versions; one editor picks the clearest lines from each and polishes the final story (USI).
Talent show: Performers compete across several rounds, changing their stage order each time to avoid favoritism; the top act wins and gets coached before the final performance (T-RANK).
Safety checklist: For planes, one checklist isn’t enough—you cross-check multiple instruments and inspectors (Best-of-N and Self-Check) to avoid hidden mistakes.

Before vs. after:

Before: One-off translations with little checking led to grammar mismatches, wrong terms, and answer leakage in benchmarks.
After: Multi-candidate sampling, ranking/fusion, and explicit QA-structure preservation yield translations that are more faithful, natural, and fair—raising benchmark reliability.

Why it works (intuition):

Diversity beats luck: More candidates at a higher temperature uncover better phrasings and reveal hidden pitfalls.
Comparison sharpens judgment: Ranking or combining forces the model to reason about trade-offs (fluency vs. faithfulness, terminology vs. style).
Structure locks fairness: Keeping question and answers together aligns grammar and prevents leaks.
Bias control: Rotating candidate positions reduces the judge’s habit of preferring earlier items.

Building blocks (with sandwich explanations):

🍞 Hook: You know how you write a draft, then read it out loud to catch mistakes. 🥬 Self-Check (SC):

What it is: The model translates once, then checks and corrects itself.
How it works: 1) Translate, 2) Start a fresh review, 3) Fix issues, 4) Output.
Why it matters: It’s cheap and fast for easy texts; without it, small errors slip through.
🍞 Anchor: A short news headline often needs just one careful pass to be great.

🍞 Hook: Choosing the best cookie from a whole tray is smarter than grabbing the first one. 🥬 Best-of-N:

What it is: Generate N translations and pick the highest-scoring one.
How it works: 1) Sample N versions, 2) Score each on clarity, completeness, and grammar, 3) Pick the best.
Why it matters: It boosts quality over single-shot, but simple scoring can miss subtle issues.
🍞 Anchor: Trying five taglines and keeping the one that sounds clearest.

🍞 Hook: When friends combine their best sentence ideas, the paragraph becomes impressive. 🥬 Universal Self-Improvement (USI):

What it is: The judge fuses the best pieces from multiple candidates into one improved translation.
How it works: 1) Sample N versions, 2) Judge merges the best phrases/terms, 3) Output the refined blend.
Why it matters: It keeps strengths and drops weaknesses; without it, you might miss the perfect mix.
🍞 Anchor: Merging strong scientific terms from one version with the smooth grammar of another.

🍞 Hook: In a fair contest, each dancer performs in different spots in the lineup so judges don’t favor the first. 🥬 Translation Ranking (T-RANK):

What it is: A multi-round ranking where candidates rotate positions to reduce bias; the top one is optionally refined at the end.
How it works: 1) Sample N versions, 2) Rank them by criteria (quality, domain terms, completeness, grammar), 3) Rotate order and re-rank across rounds, 4) Refine the winner.
Why it matters: Without rotation, the judge may prefer early candidates and miss later gems.
🍞 Anchor: Cycling five translations through positions 1–5 so each gets a fair look before crowning a winner.

🍞 Hook: Speaking to each friend in the way they understand best helps them help you. 🥬 Language-specific prompt engineering:

What it is: Tailoring instructions and few-shot examples to a language’s tricky grammar or domain terms.
How it works: 1) Show do’s/don’ts (e.g., don’t translate code), 2) Give in-language examples, 3) Emphasize gender/case alignment, 4) Ask checks for leakage.
Why it matters: Without it, translations sound odd or leak answers in languages with rich morphology.
🍞 Anchor: In Ukrainian or Greek, choosing the right noun case keeps options aligned with the question.

Put together, these pieces form a pipeline that preserves benchmark integrity while improving translation quality—leading to fairer multilingual evaluations.

03Methodology

High-level recipe: Input (English text) → Candidate generation (SC/Best-of-N/USI/T-RANK) → Judgment (score, fuse, or rank) → Optional refinement → Output (faithful, grammar-aligned translation keeping QA structure).

Two configuration modes:

Dataset mode: Flat fields, simpler structure, ideal for shorter texts and when context is straightforward.
Benchmark mode: Special handling to keep question and all answer choices together; designed to prevent answer leakage and semantic drift.

Step-by-step (with why and examples):

Choose mode (dataset or benchmark)

What happens: You select how the framework treats your input.
Why it exists: Benchmarks have tight links between question and answers; separating them causes grammar misalignment.
Example: A Winogrande fill-in-the-blank is translated as one unit so verb forms don’t reveal the answer.

Prepare prompts (general + language-specific)

What happens: The system builds instructions to the model, including rules (e.g., do not translate code), evaluation criteria, and language-specific checks.
Why it exists: Clear instructions guide accurate, consistent translations across entries and languages.
Example: For Ukrainian, prompts remind the model to match gender and case, avoid “водяний організм” (watery) when “водний організм” (aquatic) is correct.

Select method by difficulty and cost

What happens: Pick SC for easy text, Best-of-N for a cheap boost, USI for combining strengths, or T-RANK for rigorous comparisons on tricky benchmarks.
Why it exists: Different texts and languages need different levels of compute at test time.
Example: Short WMT sentences → USI. Complex science MCQs → T-RANK.

Generate candidates at higher temperature

What happens: The model samples multiple translations (or multiple prompts) to increase diversity.
Why it exists: Diversity surfaces better phrasings and exposes hidden pitfalls.
Example: Five candidate versions of a biology question produce variants that differ in terminology and word order.

Judge and select

SC: Translate once, then self-review and correct.
Best-of-N: Score each candidate 1–10 on quality, completeness, domain accuracy, and grammar; pick the max.
USI: Merge the best fragments from all candidates into one improved version.
T-RANK: Rank candidates, rotate their positions across rounds to reduce positional bias, then choose the top one and refine.
Why it exists: Careful judging turns diversity into quality; rotation reduces blind spots.
Example: A judge spots that candidate 3 preserves “lifespan” (not “life cycle”), while candidate 2 nails parental-care phrasing; USI fuses both.

Preserve QA structure and align grammar

What happens: Output keeps question and all answer options together, checking that morphology (case, gender, number) stays consistent.
Why it exists: Prevents answer leakage and maintains fairness.
Example: The Ukrainian verb is written with masked endings so it doesn’t match only one noun option.

Optional refinement

What happens: The chosen best translation is polished for fluency and domain correctness.
Why it exists: Final pass removes tiny glitches that can affect clarity.
Example: Adjusting “життєвий цикл” to “тривалість життя” when the meaning is lifespan, not life cycle.

Output and logging

What happens: Save the translation, chosen method, candidate scores/ranks, and judgment notes for reproducibility.
Why it exists: Transparent records help audits and future improvements.
Example: A CSV/JSON with the final text and the selection trail.

The four methods in action (sandwich reminders):

🍞 Hook: Quick double-checking catches typos before you hand in homework.
🥬 Self-Check (SC): One-translation plus review; great for easy text; risks over-correction if the model hallucinates errors.
🍞 Anchor: Works fine on simple news headlines.

🍞 Hook: Tasting several cupcakes before picking the winner is smarter than biting only the first.
🥬 Best-of-N: Multiple candidates scored by criteria; cheap and language-agnostic; can be biased or miss subtle issues.
🍞 Anchor: Choose the clearest tagline among five.

🍞 Hook: Combining your friends’ best sentences makes a stellar essay.
🥬 USI: Fusion of candidate strengths into one coherent output; $cost ≈ N$ + 1 calls; excels on short/medium texts, handles language quirks.
🍞 Anchor: Merge the best scientific term from one version with the smooth grammar of another.

🍞 Hook: A fair talent show rotates the lineup so first-in-line doesn’t always win.
🥬 T-RANK: Multi-round ranking with rotation; cumulative reasoning surfaces subtle errors; final refinement improves the top choice; $cost ≈ 2N$ + 1 calls.
🍞 Anchor: Five candidates each appear once at positions 1–5; the final winner is then polished.

Concrete example (simplified):

English: “We see a girl run and perform a high jump and make it over the bar. Which ending makes the most sense?” Options A–D.
Dataset/Benchmark mode: Benchmark mode keeps the question and all options together.
Candidate generation: Produce 5 translations, each handling sport phrasing differently.
Judge: T-RANK ranks them on clarity, completeness, correct sport terms (e.g., “перелазить через планку” vs awkward phrasing), and grammar agreement.
Rotation: Repeat ranking with shifted orders to reduce position bias.
Selection and refine: Choose the top candidate and fix small phrasing issues.
Output: The final question and options align naturally in Ukrainian; no option is grammatically singled out.

Secret sauce:

Multi-candidate diversity + structure-aware judging yields robust translations.
Rotation mitigates positional bias, encouraging careful review of later candidates.
QA-structure preservation prevents leakage.
Language-aware prompts and checks address morphology and terminology head-on.
The pipeline is configurable to balance cost and quality per language and task.

04Experiments & Results

🍞 Top Bread (Hook): If you change the questions on a test, even a little, the scores can change a lot. So we must check that our translations are really good.

🥬 Filling — LLM-as-a-judge and COMET:

What it is: Two ways to measure translation quality—an LLM judge compares human-like qualities, and COMET is an automatic metric that aligns well with human ratings.
How it works: LLM judge picks the better of two translations; COMET compares a translation to a human reference (or rates it without a reference in QE mode).
Why it matters: Without strong checks, we can’t tell if we fixed the real problems or just moved them around.
🍞 Bottom Bread (Anchor): A teacher (LLM judge) explains why essay A beats essay B, while a rubric (COMET) gives a score.

What was tested and why:

Datasets: FLORES and WMT24++ for general MT quality across many domains and languages.
Benchmarks: MMLU, HellaSwag, ARC, Winogrande translated into eight Eastern/Southern European languages.
Goals: 1) See if USI and T-RANK beat simpler baselines, 2) Check if better translations change benchmark scores for models, 3) Validate improvements with both metrics and LLM-as-a-judge.

Competitors:

Baselines from Global-MMLU, Okapi, and MuBench (plus earlier human/professional translations in some cases).
Classic pipelines and older LLMs used in prior work.

Scoreboard highlights (contextualized):

On WMT24++ and FLORES (EN→UK with GPT-4o-mini), USI and T-RANK outperform SC and Best-of-N in most settings. On FLORES, USI reached the top COMET among the tested methods, showing strong fusion benefits for short/medium texts.
LLM-as-a-judge comparisons for MMLU show significant wins for the new translations over Global-MMLU in Ukrainian, Romanian, and Lithuanian (e.g., thousands more wins than losses), indicating real quality gains.
Evaluating mid-sized models (Gemma 3, Qwen 3, Llama 3.1) on the improved benchmarks yields higher scores across languages and tasks. Think of it like moving from a noisy microphone to a clear one—the singer (model) didn’t change, but you can now hear the true performance.

Make the numbers meaningful:

A few COMET points often mean the difference between awkward and natural phrasing.
Average benchmark gains, for example, +3.42% on Winogrande and noticeable bumps on ARC, HellaSwag, and MMLU, are like raising a class average from a B- to a B+ because the test questions finally make sense the same way in every language.
Per-language average improvements span roughly +1.37% to +3.89%—evidence that languages with rich morphology benefit from structure-aware translation.

Surprising findings:

USI can edge out T-RANK on short, general-domain sentences (fusion shines on bite-sized text).
T-RANK tends to do better on complex benchmark items where reasoning can be derailed by grammar mismatches.
Even with rotation, a small residual positional bias remains, reminding us that judging isn’t perfect and multiple checks are helpful.
Some professionally translated sets still outperform automated ones in specific cases, showing that human expertise remains valuable—but the gap is closing fast.

Bottom line: Across metrics and judges, and across many languages, the pipeline’s methods deliver more faithful, natural translations that preserve test structure. That leads to fairer, more accurate multilingual model evaluation.

05Discussion & Limitations

Limitations:

LLM judges can be biased (e.g., positional bias); even with rotation, some preference may remain. Automated metrics like COMET, while strong, don’t perfectly match human judgment in all domains.
The framework currently applies a chosen method uniformly per entry; it doesn’t yet auto-detect difficulty to switch methods on the fly (e.g., SC → USI → T-RANK).
Many experiments use closed models; broader open-weight evaluations would clarify how much the methods help weaker translators.
Most testing focused on Eastern/Southern European languages; generalization to very low-resource or typologically distant languages needs more study.

Required resources:

Access to one or more LLMs for generation and judging, with enough quota to sample multiple candidates.
Storage/logging for candidates, ranks, and final outputs.
Optional: Reference sets (for COMET) and compute for metric evaluation.

When not to use:

Tiny, trivial strings where SC already produces near-perfect results and extra compute brings little gain.
Domains with strict style guides that forbid any fusion or rephrasing beyond literal matching without human oversight.
Extremely code-heavy content where the risk of accidental code translation is unacceptable unless prompts and guards are very strict.

Open questions:

Can we learn to predict translation difficulty and choose the cheapest effective method per input automatically?
How do different judges (open vs closed, multilingual strengths) affect rankings/fusion quality?
Can we design even better bias-reduction strategies beyond rotation?
What are the best domain-specific prompts for medicine, law, or programming to avoid term drift and code mistranslation?
How well do these methods scale to truly low-resource or morphologically extreme languages without parallel data?

06Conclusion & Future Work

Three-sentence summary: The paper introduces a configurable, automated pipeline that translates benchmarks and datasets while preserving question–answer structure and language nuances. It uses test-time compute methods—Self-Check, Best-of-N, USI fusion, and T-RANK multi-round ranking—to improve quality and reduce judge bias. Evaluations with COMET and LLM-as-a-judge, plus better downstream model scores, show the translations are more accurate and trustworthy than popular baselines.

Main achievement: Showing that translation quality—and especially preserving QA structure—significantly changes benchmark outcomes, and that USI/T-RANK provide a practical, scalable path to higher-quality multilingual resources.

Future directions: Adaptive method selection per input difficulty; stronger, possibly open-weight judges; richer bias controls; and expanding beyond Europe to truly low-resource languages. Integrating dedicated quality models and domain-specific prompt libraries could further raise consistency.

Why remember this: Good translations make fair tests. By turning translation into a smart competition-and-refinement process, this work upgrades the reliability of multilingual AI evaluation—so we can compare models across languages with confidence and include more communities in the story.

Practical Applications

•Translate educational benchmarks into local languages while preserving fairness for students.
•Localize internal QA datasets for multilingual product evaluation without manual overhead.
•Audit and repair existing multilingual benchmarks that suffer from answer leakage or semantic drift.
•Generate high-quality training data in mid-resource languages for instruction tuning and evaluation.
•Translate government forms and tests with structure-preserving checks to avoid unintended hints.
•Improve multilingual customer support knowledge bases by fusing the best candidate translations.
•Assist medical or legal content localization with domain-aware terminology checks and fusion.
•Support open-weight model developers by providing cleaner multilingual benchmarks for training and testing.
•Automate translation quality checks using LLM-as-a-judge plus COMET for scalable QA.
•Build adaptive translation workflows that choose SC, Best-of-N, USI, or T-RANK based on text difficulty.

Version: 1