No One-Size-Fits-All: Building Systems For Translation to Bashkir, Kazakh, Kyrgyz, Tatar and Chuvash Using Synthetic And Original Data

Dmitry Karpov

No One-Size-Fits-All: Building Systems For Translation to Bashkir, Kazakh, Kyrgyz, Tatar and Chuvash Using Synthetic And Original Data

Intermediate

Dmitry Karpov2/4/2026

arXiv

Key Summary

•The paper tries several different ways to translate five low-resource Turkic languages, instead of forcing one method to fit all.
•For Bashkir and Kazakh, fine-tuning a multilingual translator (NLLB-200) with extra lightweight adapters (LoRA/DORA) on lots of synthetic data worked best.
•For Chuvash, where the base translator had almost no practice, retrieval-augmented prompting with a large language model (DeepSeek-V3.2) worked best.
•For Tatar, a strong zero-shot large model already did very well, and retrieval sometimes made it worse.
•For Kyrgyz, zero-shot prompting (no extra training) matched or beat other tricks.
•They built and released a big multi-language dataset (YaTURK-7lang) and model weights so others can reuse their work.
•The key idea is to pick the right tool per language: fine-tune when there’s some data and similarity to related languages, retrieve examples when there’s almost none, and fall back to zero-shot when it’s already strong.
•They carefully filtered synthetic data to avoid leaking test examples and showed that pseudolabeling can help in low-resource settings.
•Trying to combine outputs (stacking) with automatic scores didn’t reliably beat the best single system, showing that evaluation for low-resource MT is tricky.

Why This Research Matters

Millions of people speak Bashkir, Kazakh, Kyrgyz, Tatar, and Chuvash, but they lack the giant datasets that power today’s translation apps. This work shows a practical way to bring high-quality translation to these communities by flexibly mixing fine-tuning, synthetic data, retrieval, and zero-shot methods. Better translation means more access to health, education, government services, and local news in people’s own languages. It helps preserve cultural identity, since stories and information can flow in and out of these languages more easily. The released dataset and model weights let local teams adapt and improve systems without starting from scratch. In emergencies or public announcements, clearer translations can literally improve safety and response times. Over time, this approach can scale to many other underserved languages worldwide.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re trying to help five friends read letters from faraway cousins, but each cousin writes in a language that barely has any textbooks or dictionaries. It’s hard to teach reading when there aren’t many examples to learn from.

🥬 The Concept (Low-Resource Language Translation): What it is: Translating languages that don’t have much training data for AI. How it works: 1) Gather all the real bilingual sentences you can find, 2) Create extra practice sentences (synthetic data) to fill the gaps, 3) Pick models and tricks that learn the most from very little data. Why it matters: Without these steps, the translator guesses too much and makes silly mistakes, like mixing words or inventing grammar.

🍞 Anchor: If you only have a tiny phrasebook for Bashkir, your AI might misunderstand common words; with smart extra practice, it starts getting them right.

The World Before: Big languages like English, Russian, and French have millions of sentence pairs to train on, so machine translation (MT) works quite well. But many languages—like Bashkir, Kazakh, Kyrgyz, Tatar, and Chuvash—don’t have big, tidy datasets. A popular model called NLLB-200 could handle many languages, but its coverage and practice on some of these Turkic languages were uneven, especially for Chuvash.

🍞 Hook: You know how you learn best when a teacher shows similar examples to the problem you’re solving? That’s because examples give you a pattern to follow.

🥬 The Concept (NLLB-200-Distilled-600M): What it is: A compact, multilingual translation model trained on many languages. How it works: 1) It reads the source sentence, 2) converts it into a shared language space, 3) generates the target sentence using patterns it learned across languages. Why it matters: When a language has some related cousins in the model, it can borrow helpful patterns—like shared words or similar grammar.

🍞 Anchor: Because Kazakh and Bashkir are related and present in NLLB’s training, the model can generalize better across them than for a language like Chuvash with less prior data.

The Problem: The competition asked for translations in five language pairs where data is scarce. The team found that: 1) some languages had modest amounts of parallel text; 2) others, like Chuvash, had very little in the base model’s pretraining; and 3) zero-shot (no extra training) results varied a lot across languages.

Failed Attempts and Roadblocks: A single strategy didn’t win everywhere. Plain fine-tuning helped Bashkir and Kazakh but not Chuvash. Retrieval-augmented prompting helped Chuvash but hurt Tatar and Kyrgyz in some settings. Combining outputs (stacking) using automatic scores sometimes made things worse. They also couldn’t use some promising datasets that weren’t available at the time, and compute limits prevented longer, fancier training.

🍞 Hook: Think of synthetic data like practice worksheets a teacher makes when there aren’t enough problems in the textbook.

🥬 The Concept (Synthetic Data Generation): What it is: Automatically created translations used as extra training examples. How it works: 1) Use a strong online translator (Yandex.Translate), 2) translate source sentences into Russian when needed, 3) then into the target Turkic language, 4) filter anything that looks like test set overlap. Why it matters: Without synthetic data, the model can’t see enough examples to learn reliable patterns.

🍞 Anchor: When Bashkir and Kazakh didn’t have enough examples, synthetic data boosted the training set into the millions of pairs, and quality jumped.

The Gap: People needed a recipe that picks the right method for each language’s reality: sometimes fine-tune, sometimes retrieve examples for prompting, sometimes trust a strong zero-shot model. There was no one-size-fits-all.

Real Stakes: Better translations mean kids can read stories in their language, patients can understand medical guidance, farmers can access weather alerts, and families can chat across language lines. For these communities, every accuracy point is a real difference in daily life.

🍞 Hook: If a classmate who’s slightly better at math shows you their method, you might pick it up faster.

🥬 The Concept (Data Pseudolabeling): What it is: Using a strong model’s guesses as labels to train another model. How it works: 1) A teacher model translates unlabeled sentences, 2) we treat those outputs as training labels, 3) a student model learns from them. Why it matters: Without pseudolabels, we ignore lots of helpful unlabeled text that could teach structure and vocabulary.

🍞 Anchor: Using Yandex translations as training labels helped the fine-tuned model learn smoother Bashkir and Kazakh patterns.

🍞 Hook: When your memory is fuzzy, seeing a similar example can spark the right answer.

🥬 The Concept (ANNOY-Based Indexing and Retrieval-Augmented Prompting): What it is: A fast way to find similar example sentences and show them to a big language model before it translates. How it works: 1) Turn every sentence into a vector, 2) use ANNOY to quickly find the closest vectors for a new sentence, 3) paste those pairs into the prompt, 4) ask the LLM to translate using those hints. Why it matters: Without examples, an LLM might forget rare words or local grammar; with examples, it copies the right style.

🍞 Anchor: For Chuvash, feeding thousands of similar pairs into DeepSeek-V3.2 let it mimic the right patterns and jump far ahead of zero-shot.

02Core Idea

🍞 Hook: You know how a toolbox has a hammer, a screwdriver, and a wrench—because not every problem is a nail? Translation for low-resource languages is the same way.

🥬 The Concept (Core Idea): What it is: Choose the right method per language—fine-tune with lightweight adapters and synthetic data when possible, or use retrieval-augmented prompting when pretraining is weak—and don’t be afraid to go zero-shot if it’s already strong. How it works: 1) Gather original parallel data and generate synthetic pairs, 2) for languages with some coverage (Bashkir, Kazakh) fine-tune NLLB-200 with LoRA/DORA adapters, 3) for languages with little coverage (Chuvash), build a fast index to fetch similar examples and prompt a strong LLM, 4) for cases like Kyrgyz (and sometimes Tatar), use or compare with zero-shot LLMs. Why it matters: A single fixed plan wastes effort and misses accuracy; a flexible plan squeezes the most quality out of scarce data.

🍞 Anchor: Bashkir and Kazakh improved most with fine-tuning; Chuvash leap-frogged with retrieval + prompting; Kyrgyz did best as zero-shot.

Multiple Analogies:

Sports team: Some games need speed (prompting), some need strength (fine-tuning), and sometimes your goalie (zero-shot) is already doing great—so you let them play as-is.
Cooking: When there’s no fresh ingredient (data), you use substitutes (synthetic data). If the oven (pretrained model) already runs hot for a dish (language), you just time it right (light adapters). If not, you follow a similar recipe (retrieved examples) while you cook.
Maps: If roads are clear (good pretraining), you cruise straight (zero-shot). If roads are bumpy but passable (some data), you tune your suspension (LoRA) and keep going. If the area isn’t mapped (very low-resource), you follow landmarks (retrieval examples) to find your way.

Before vs After:

Before: People often tried one-fits-all—either pure fine-tuning or pure prompting—and hoped it worked for every language.
After: We match strategy to language reality: fine-tune where cousins and data help; retrieve examples for truly data-starved cases; trust zero-shot when it’s already excellent.

Why It Works (Intuition):

Similar languages share grammar and words, so multi-task fine-tuning transfers knowledge: learning Bashkir patterns can help Kazakh.
Synthetic data enlarges the playground, giving the model more chances to see rare words and constructions.
Retrieval-augmented prompts hand the LLM concrete, local patterns right when it needs them, which compensates for weak pretraining.
Zero-shot is strong when a frontier model has already learned enough patterns across many languages.

Building Blocks:

Data assembly: collect and clean all available pairs; add synthetic data carefully with leakage checks.
Lightweight adaptation: LoRA/DORA adapters on NLLB-200 reuse most knowledge while nudging the model toward each target language.
Retrieval engine: ANNOY index over sentence embeddings to grab the closest examples fast.
Prompt design: a strict, short instruction and many examples to guide the LLM for languages like Chuvash.
Evaluation: use chrF++ to judge character-level overlap and fluency; compare to zero-shot baselines; try stacking cautiously.

🍞 Anchor: Think of a coach who picks training drills per player: sprinters get sprints, jumpers get jumps—and the whole team wins more often.

03Methodology

At a high level: Input sentence → Decide strategy per language → (A) Fine-tune NLLB-200 with LoRA/DORA and generate → or (B) Retrieve examples with ANNOY and prompt an LLM → Output translation.

Step A: Fine-tuning the Translator (Best for Bashkir, Kazakh) 🍞 Hook: Imagine adding snap-on parts to a bike so it rides better on your local roads without rebuilding the whole bike.

🥬 The Concept (LoRA/DORA Adapters on NLLB-200): What it is: Small plug-in layers that gently steer a big translator toward your target language. How it works: 1) Start with NLLB-200-distilled-600M, 2) add a few special adapter layers (LoRA/DORA) to key parts (like attention and feed-forward), 3) train these small parts on your data while keeping the rest mostly fixed, 4) generate translations with beam search. Why it matters: Full fine-tuning is heavy and risky; small adapters learn fast from limited data and share knowledge across related languages.

🍞 Anchor: With adapters trained on combined Turkic data, Bashkir and Kazakh scores rose into the high 40s on chrF++.

Recipe Details:

Build the training mix: Combine all real parallel data with synthetic translations produced by Yandex.Translate; filter anything overlapping with tests; add language-pair prefix tokens (like <prefix_rus_bash>) so the model knows the direction.
Two modes:
- Multi-task then adapters: First train across all languages together for 1 epoch; then train LoRA/DORA adapters per language. This encouraged knowledge transfer between similar languages.
- Single-language fine-tune: Train only on one pair. This was simpler but usually weaker than multi-task+adapters.
Training settings (simplified): sequence length ~128 tokens; AdamW optimizer (8-bit variant to save memory); learning rates around 2e-4 to 5e-4; a couple of epochs; beam search (5 beams) at inference; repetition penalty to avoid loops.
Why each step exists:
- Prefix tokens: Without them, the model can mix up languages and produce the wrong script.
- Multi-task phase: Without cross-language practice, you miss helpful similarities (like shared Turkic morphology).
- Adapters: Without adapters, you either overfit with full fine-tuning or don’t adapt enough.

Example: A Russian→Kazakh sentence about weather appears rarely in real data. Synthetic pairs add more weather phrases; multi-task training borrows structure from Bashkir; the adapter locks in Kazakh-specific endings so the final sentence sounds right.

Secret Sauce A: Combine multi-task knowledge transfer with synthetic data and small adapters so each language gets just enough personalized attention without losing shared skills.

Step B: Retrieval-Augmented Prompting (Best for Chuvash; mixed for Tatar/Kyrgyz) 🍞 Hook: When you can’t remember how to spell a tricky word, seeing similar words jogs your memory.

🥬 The Concept (ANNOY Retrieval + LLM Prompting): What it is: Fetch similar source sentences and their translations from a big index, then show them to a large model so it can imitate the right patterns. How it works: 1) Encode every source sentence with a sentence-embedding model (e.g., gte-small or MiniLM), 2) build an ANNOY index using cosine similarity, 3) for a new sentence, pull the top-N most similar examples, 4) place them into a strict prompt that says “Return only the translation,” 5) generate with a capable LLM (like DeepSeek-V3.2). Why it matters: If the base translator never learned Chuvash well, concrete examples guide the LLM to the right vocabulary and grammar at the moment of need.

🍞 Anchor: For English→Chuvash, feeding thousands of nearest examples to DeepSeek-V3.2 raised chrF++ to about 39.5 on the test set.

Recipe Details:

Build the index: Use embedding dimension ~384; cosine similarity; ~100 trees in ANNOY. For English↔Chuvash, use gte-small; for others, MiniLM multilingual.
Choose TOP_N: For Chuvash, a very large TOP_N (around 7000) worked; search_k scaled with trees and TOP_N to improve recall.
Prompt pattern: “Translate the following phrase into target_lang. RETURN ONLY TRANSLATION… Here are similar examples: src1->tgt1 … Translation into target_lang:”. For zero-shot, skip the examples section.
Models and settings: DeepSeek-V3.2 (reasoning mode, temp ~0.7), DeepSeek-R1/N1, MiMo-V2, Gemma3 (temp 0). Very long prompts were truncated as needed.
Why each step exists:
- Embeddings + ANNOY: Without fast, decent similarity search, examples won’t be close enough to help.
- Strict prompt: Without it, LLMs might chat instead of translating.
- Large TOP_N (for Chuvash): Without enough examples, the model misses rare words and orthography.

Example: If the query sentence mentions a local festival name rarely seen in training, the index likely retrieves a past sentence with the same name and its correct translation, which the LLM then copies appropriately.

Secret Sauce B: Dial the example count and embedding choice per language; Chuvash needed lots of context, while Tatar often did better with fewer or even no examples when the zero-shot model was already strong.

Step C: Stacking and Selection (Tried, not dominant) 🍞 Hook: If three friends suggest different wordings for a sentence, picking the “best” one automatically isn’t always simple.

🥬 The Concept (Stacking with Semantic Similarity or Perplexity): What it is: Given multiple candidate translations, pick the one that “looks best” by an automatic score. How it works: 1) Compute similarity between source and candidates using a bilingual encoder (e.g., LaBSE) or check how probable a candidate is under a language model (perplexity), 2) choose the top-scoring candidate. Why it matters: If it worked consistently, you’d always get the best of all worlds without manual review.

🍞 Anchor: In practice, LaBSE selection slightly hurt Kazakh/Kyrgyz validation scores, and perplexity selection did not clearly help Tatar, showing that automatic pickers aren’t yet reliable here.

Putting It Together:

Decide per language:
- Bashkir/Kazakh: multi-task fine-tuning + LoRA/DORA adapters on a mix of real + synthetic data.
- Chuvash: retrieval-augmented prompting with many nearest examples into a strong LLM.
- Tatar/Kyrgyz: compare zero-shot vs. retrieval; keep zero-shot when it wins.
Safeguards: filter any synthetic pair overlapping with test sets; keep prompts strict; avoid over-long contexts when they hurt.
Output: translate with beams or LLM decoding; evaluate with chrF++.

04Experiments & Results

🍞 Hook: When you take a quiz, your percentage score means little unless you know how others did. A 78% could be amazing if everyone else got 60%.

🥬 The Concept (chrF++ Metric): What it is: A translation score that checks how many character chunks and word pieces match a reference translation. How it works: 1) Split text into small character and word fragments, 2) compare overlaps between the model output and the gold translation, 3) combine these overlaps into a single score. Why it matters: Without a fair yardstick, we can’t tell if a change is truly better or just different.

🍞 Anchor: Saying “chrF++ 49.7” is like saying “an A grade when the class average is much lower”—it’s strong performance for low-resource MT.

The Test: The team reported official validation/test scores from a competition server. They trained with original plus synthetic data (carefully filtered) for fine-tuning, and built retrieval indexes for prompting trials. They evaluated different strategies per language: fine-tuning with adapters, zero-shot prompting, retrieval-augmented prompting, and stacking.

The Competition: Baselines included multiple strong LLMs: DeepSeek-R1, DeepSeek-V3.2, MiMo-V2, and Gemma3. They also compared single-language fine-tuning, multi-task + adapters, and a stacking approach.

The Scoreboard (with context):

Kazakh (Russian→Kazakh): LoRA fine-tuning on synthetic+original data achieved around 49.9 chrF++ on validation and 49.7 on test—like getting an A in a tough class.
Bashkir (Russian→Bashkir): Fine-tuning scored around 46.9 on test—solid, boosted by synthetic data and knowledge transfer.
Chuvash (English→Chuvash): Zero-shot was weak, but retrieval-augmented prompting with DeepSeek-V3.2 reached about 39.5 on test—a big jump given the model’s poor pretraining coverage.
Tatar (English→Tatar): A strong zero-shot from DeepSeek-V3.2 hit about 43.7 on validation, and final submissions scored about 41.6 on test. Adding too many examples sometimes lowered performance, likely due to noise or distraction.
Kyrgyz (Russian→Kyrgyz): Zero-shot MiMo-V2 did best (~46.6 on validation; ~45.6 on test), and retrieval didn’t reliably help—showing that when zero-shot is strong, extra hints can sometimes confuse.

Surprising Findings:

More isn’t always better: Very large context windows that helped Chuvash sometimes hurt Bashkir/Kazakh, dropping scores notably.
Stacking didn’t shine: Picking among multiple system outputs using LaBSE similarity or perplexity sometimes performed worse than the best single system.
Pretraining coverage matters: NLLB had little Chuvash exposure, making fine-tuning underperform there but leaving room for retrieval + LLM prompting to succeed.

Why Results Differed by Language:

Shared roots: Bashkir and Kazakh benefited from multi-task training and synthetic data because their structures overlap and the base model had some prior.
Sparse exposure: Chuvash lacked pretraining, so explicit examples in prompts gave the LLM the patterns it was missing.
Already strong zero-shot: For Kyrgyz (and often Tatar), a top LLM handled many patterns out of the box; extra examples sometimes added noise.

Takeaway Numbers:

Kazakh: ~49.7 chrF++ (LoRA fine-tuning; released weights)
Bashkir: ~46.9 chrF++ (fine-tuning; released weights)
Chuvash: ~39.5 chrF++ (retrieval + DeepSeek-V3.2)
Tatar: ~41.6 chrF++ (best submission among zero-shot/retrieval)
Kyrgyz: ~45.6 chrF++ (zero-shot best)

05Discussion & Limitations

Limitations:

Compute budget: Heavier or longer fine-tuning (or larger adapters) might have boosted results further but wasn’t feasible.
Synthetic noise: Automatic translations can carry errors or stylistic oddities that the model may learn.
Coverage gaps: Languages not well represented in pretraining (like Chuvash) remain hard for classic fine-tuning.
Retrieval tuning: The best TOP_N and embedding model differ per language; mis-tuning can hurt performance.
Evaluation fragility: Automatic pickers (stacking by similarity or perplexity) were unreliable; even chrF++ can miss nuances like dialect or register.

Required Resources:

For fine-tuning: A GPU capable of training 600M-parameter models with 8-bit optimizers; storage for multi-million-pair datasets.
For prompting: Access to strong LLM APIs; building and serving ANNOY indexes; careful prompt construction and truncation control.

When NOT to Use:

Don’t rely on retrieval-augmented prompting when your zero-shot baseline is already strong and examples are noisy—it can distract the model.
Don’t fine-tune from scratch when you lack even minimal synthetic/real pairs—you might overfit or learn artifacts.
Don’t stack with naive automatic pickers unless validated per language; it may degrade quality.

Open Questions:

Can better, language-aware retrieval (e.g., bilingual embeddings tuned for each pair) beat current ANNOY settings?
Would small, in-language pretraining (e.g., monolingual LMs for Chuvash) plus adapters outperform giant zero-shot LLMs?
Can more robust quality estimators or human-in-the-loop selection make stacking helpful?
How far can improved synthetic pipelines (back-translation, multi-teacher ensembles) push low-resource MT without human labels?

06Conclusion & Future Work

3-Sentence Summary: This work shows that there is no single best method for low-resource Turkic MT—each language benefits from a different mix of fine-tuning, retrieval-augmented prompting, and zero-shot translation. For Bashkir and Kazakh, multi-task fine-tuning with LoRA/DORA on synthetic+original data delivered top scores, while for Chuvash, large-context retrieval plus a strong LLM worked best; Tatar and Kyrgyz often favored zero-shot. The team also shares a large dataset and model weights to help others build on these results.

Main Achievement: A practical, language-by-language playbook that picks the right approach—fine-tune, retrieve, or zero-shot—demonstrating clear wins in multiple pairs and releasing the resources for the community.

Future Directions: Explore language-specific pretraining for the most under-resourced languages; design smarter retrieval tuned per pair; strengthen synthetic pipelines with back-translation and multi-teacher ensembles; develop better automatic selection and evaluation methods for stacking. Also investigate compact, locally deployable models adapted with adapters for real-time on-device translation.

Why Remember This: It replaces the myth of “one-size-fits-all” with a proven recipe: adapt the method to the language. With careful data creation, lightweight adapters, and retrieval when needed, we can unlock useful translation for communities that tech often overlooks.

Practical Applications

•Build a Bashkir or Kazakh translator by fine-tuning NLLB-200 with LoRA/DORA on mixed real + synthetic data.
•For Chuvash (or similarly under-represented languages), deploy retrieval-augmented prompting using ANNOY and a strong LLM.
•When a zero-shot LLM is already strong (e.g., Kyrgyz), prefer zero-shot to avoid noise from large example contexts.
•Use strict prompts (“return only translation”) and filter test overlaps from synthetic data to prevent leakage.
•Tune TOP_N and embedding models per language; large TOP_N may help very low-resource cases but hurt stronger ones.
•Evaluate with chrF++ and sanity-check with native speakers to catch errors not visible to automatic metrics.
•Release adapters and datasets to enable local teams to iterate quickly without heavy compute.
•Try multi-task pre-adaptation across related languages before training per-language adapters.
•Avoid naive stacking; if combining systems, validate selection methods carefully on a development set.
•Maintain language-specific lexicons (names, loanwords) and consider light post-editing rules to polish outputs.

Version: 1