Arcee Trinity Large Technical Report
Key Summary
- •Trinity is a family of open language models that are huge on the inside but only wake up a few 'experts' for each word, so they are fast and affordable to run.
- •The biggest model, Trinity Large, has 400 billion total parameters but uses only about 13 billion per token thanks to a sparse Mixture-of-Experts design.
- •A new balancing trick called SMEBU gently and steadily evens out how much work each expert does, keeping training stable with no scary loss spikes.
- •The model looks at text both up close and far away using interleaved local and global attention, helping it handle super long documents efficiently.
- •Gated attention acts like a volume knob that turns down noisy parts and turns up useful parts, which improves long-context understanding and stability.
- •A custom 200k-token tokenizer and special number-splitting help the model compress text better and do math more reliably.
- •A clever data loader called RSDB shuffles long documents in a fair way, lowering step-to-step noise and preventing instability during training.
- •Trinity Large learned from 17 trillion tokens and achieved strong scores on coding, math, knowledge, and reasoning benchmarks.
- •With FP8 inference and the architecture choices, Trinity Large delivers high throughput on modern GPUs while keeping quality high.
- •All models are open-weight, so organizations can audit, host, and adapt them with clear data and control.
Why This Research Matters
Trinity shows how to build very large, open models that are fast and affordable to run, even on huge documents. This makes it practical to deploy AI for real workflows like legal review, software engineering, and scientific research without sky-high costs. The new SMEBU method keeps expert usage balanced, so training stays stable and predictable, saving time and compute. Interleaved attention and smart tokenization improve both speed and understanding, especially for long contexts and math. Open weights mean organizations can host the models privately, audit data, and meet compliance needs. The approach is a blueprint other teams can reuse to push sparsity and long-context capability further. In short, it helps turn frontier-scale AI from a lab demo into a dependable everyday tool.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
You know how a school has many teachers, and each one is best at a different subject? If every teacher had to answer every question from every student all day, the school would be slow and crowded. But if the math teacher answers math questions and the art teacher answers art questions, things go faster and students learn more. That is the big idea behind this research.
🍞 Hook: You know how you only ask the right expert for help—like a nurse for health questions and a librarian for book ideas—so you get fast, accurate help? 🥬 The Concept (Sparse Mixture-of-Experts): It’s a model where many small specialists (experts) exist, but only a few are used for each word. How it works: (1) The model reads a token, (2) a router picks a few best-fit experts, (3) only those experts process the token, (4) their answers are mixed into one result. Why it matters: Using all experts all the time is too slow and expensive; picking only a few keeps speed high without losing brainpower. 🍞 Anchor: When the model sees code, it sends the token to code-savvy experts; for history, it calls history-savvy experts.
The world before: Big language models were mostly “dense,” meaning every part of the model worked on every word. That made them powerful but costly and slow, especially when reading or writing very long texts. People started using Mixture-of-Experts (MoE) to get the best of both worlds—giant total capacity with only a small active part per token. At the same time, users wanted models to be fast at inference, understand super long contexts, follow instructions, write and fix code, and run inside companies with clear data, licenses, and locations.
The problem: Sparse MoE models can be touchy to train. If the router keeps sending too many tokens to just a few experts, those experts get overloaded and others get ignored. This can cause “collapsed experts,” unstable training, and flatlined learning. Also, reading hundreds of thousands of tokens is expensive unless the attention system is very efficient.
Failed attempts: A common MoE balancing method updates each expert’s bias in jumpy, yes/no steps (like cranking a knob all the way left or right). It helps for a while but often can’t settle perfectly, which can make training wobble and even fail later. On the attention side, pure global attention everywhere is accurate but slow and memory-heavy for long texts, while purely local attention can miss far-away connections.
The gap: The field needed (1) a smoother, steadier way to balance experts so training stays calm, and (2) an attention setup that mixes local detail with global context efficiently. It also needed better data handling so long documents don’t create lumpy, noisy training steps.
🍞 Hook: Imagine reading a comic book sometimes one panel at a time (local) and sometimes the whole page (global) to understand the story. 🥬 The Concept (Interleaved Local and Global Attention): The model alternates layers that look nearby (local) and layers that can look anywhere (global). How it works: (1) Three local layers scan nearby tokens with a sliding window, (2) one global layer takes a big-picture view without position limits, (3) repeat. Why it matters: This keeps long-context processing fast while still catching far-away clues. 🍞 Anchor: When summarizing a textbook chapter, local layers notice sentence details, and global layers connect ideas across many pages.
🍞 Hook: Think of a volume knob that turns down background chatter and turns up the helpful voice. 🥬 The Concept (Gated Attention): The attention output is multiplied by a learned gate that scales signal strength. How it works: (1) The model computes attention, (2) a gate computes importance scores, (3) attention is elementwise scaled by the gate, (4) the result is projected. Why it matters: Without gating, models can over-focus on unhelpful tokens (attention sinks), causing instability and weaker long-context skills. 🍞 Anchor: When asked “What’s the proof idea?”, the gate turns up key steps and turns down repeated filler words.
Real stakes: Faster, cheaper models that stay smart over long documents unlock practical uses—like legal reviews, giant codebase understanding, long research papers, and always-on agents. Open weights let companies run models privately, audit data, and meet rules. With better stability and throughput, you can serve more users and tasks without breaking the bank.
In short, the paper’s new steady expert-balancing method, plus the zoom-in/zoom-out attention and stability tricks, makes very large, open, efficient models that are actually trainable at scale and fast at inference.
02Core Idea
Aha! If you gently guide which experts get used (smooth, bounded updates with a little memory of the past) and combine near-sight and far-sight attention, you can train a gigantic, sparse model that stays stable and runs fast.
Three analogies:
- School counselor: Students (tokens) visit just the right counselors (experts). A calm appointment system (SMEBU) evens the line so no counselor is swamped.
- City traffic: Green lights (router) let cars flow to the right roads (experts). SMEBU is like smart lights that adjust smoothly, not jerkily, preventing jams.
- Orchestra: Local attention is like section rehearsals; global attention is the full symphony. The gate is the conductor’s hand, turning some sections up or down.
Before vs after:
- Before: Expert balancing nudges were on/off and jumpy, often leading to oscillations and collapses later in training. Pure global attention was powerful but heavy for long contexts; pure local missed distant links.
- After: SMEBU applies soft-clamped, momentum-smoothed updates that converge calmly. Interleaving local and global layers cuts compute while preserving far-reaching understanding. Gating squashes attention sinks. Result: clean, stable loss curves and very long-context skills.
🍞 Hook: Imagine adjusting a faucet so the water flows just right—not off, not blasting. 🥬 The Concept (SMEBU): A soft-clamped momentum update that balances expert loads smoothly. How it works: (1) Measure how much each expert is over/under-used relative to average, (2) squash this signal with tanh to keep updates bounded, (3) center and add momentum to smooth noise, (4) update expert biases gradually. Why it matters: Without SMEBU, sign-only updates stay too coarse and can’t settle, causing late-stage instability. 🍞 Anchor: If one pizza chef is too busy and another is idle, SMEBU gently shifts orders over time so both work evenly, never overcorrecting.
🍞 Hook: Picture a dimmer switch that can make lights a little brighter or dimmer, not only on or off. 🥬 The Concept (Sigmoid Routing): Uses sigmoid-based scores to choose top experts while gating outputs with pure router scores. How it works: (1) Compute a smooth score per expert with a sigmoid, (2) pick top-K experts using scores plus a bias, (3) gate outputs using normalized scores without bias, (4) update biases separately (with SMEBU). Why it matters: Softer scores mean stabler choices and less fragile routing. 🍞 Anchor: It’s like ranking helpers by gentle confidence scores, picking a few top helpers, and then weighting how much each helper’s advice counts.
🍞 Hook: You know how athletes use steady training plans to improve faster with fewer injuries? 🥬 The Concept (Muon Optimizer): A training rule that improves sample efficiency and supports larger batch sizes for hidden layers. How it works: (1) Apply Muon to hidden layers with a width-aware learning-rate adjustment, (2) keep AdamW for embeddings and output, (3) train longer with fewer hiccups. Why it matters: Without Muon, you may need smaller batches and more tokens to reach the same quality. 🍞 Anchor: Muon is a smarter workout coach that helps the model get more gains from the same practice time.
🍞 Hook: Think of putting a seatbelt on both before and after a roller-coaster loop for extra safety. 🥬 The Concept (Depth-Scaled Sandwich Norm): Normalize both before and after each sublayer, scaling the second norm by model depth. How it works: (1) RMSNorm the input to a block, (2) run the sublayer (attention or MoE), (3) RMSNorm the result with a gain that grows with depth, (4) add residual. Why it matters: Without it, activations can drift or spike, making deep models less stable. 🍞 Anchor: It keeps each block’s output in a comfy range so training remains smooth, even when stacks get very tall.
🍞 Hook: Imagine marking each word’s place by turning a tiny dial so the model remembers order. 🥬 The Concept (Rotary Positional Embeddings, RoPE): A way to encode token order by rotating query/key vectors. How it works: (1) Apply a rotation that depends on position, (2) attention sees relative positions naturally, (3) works especially well in local windows. Why it matters: Without position info, word order gets scrambled. 🍞 Anchor: In a recipe, “add eggs then sugar” is different from “sugar then eggs”—RoPE helps the model feel that difference.
🍞 Hook: Imagine reading with a small moving window that slides over lines so you don’t waste time looking at the whole book at once. 🥬 The Concept (Sliding Window Attention): Each local attention layer only looks at nearby tokens. How it works: (1) Pick a window size, (2) limit keys/values to tokens inside the window, (3) slide the window forward as you read. Why it matters: Without it, memory and compute explode on very long texts. 🍞 Anchor: While scanning a long document, you focus on a few sentences at a time, then slide forward.
Put together, these pieces let Trinity Large be massively capable (400B parameters total) while activating only about 13B per token, staying stable in training, and performing strongly on long contexts and hard benchmarks.
03Methodology
At a high level: Raw text → Tokenizer (200k BPE with smart number and script handling) → Transformer blocks with interleaved local/global attention, gated attention, and GQA/QK-Norm → Sparse MoE layers with sigmoid routing and SMEBU balancing → Language head → Output tokens. Training uses Muon (hidden layers) + AdamW (embeddings/output), a three-phase data mix, and a special data buffer (RSDB) to keep batches steady. Then context extension to 256K (up to 512K tested), light instruction tuning, and a short RL stage.
Step 1. Tokenization and pretokens
- What happens: A custom 200k-token BPE vocabulary is trained with careful pretokenization: split digit strings into place-aligned chunks (e.g., 1|234|567), separate scripts like CJK/Thai/Korean, handle punctuation well, and fall back to bytes for full coverage.
- Why: Better compression and arithmetic performance; fewer tokens to read makes everything faster.
- Example: “The price is 1234567!” becomes tokens like [“ The”, “ price”, “ is”, “ 1”, “234”, “567”, “!”]. The three-digit grouping helps the model learn place values.
🍞 Hook: Like sorting LEGO pieces by size and color before building so you can snap faster. 🥬 The Concept (Tokenizer with number-aware BPE): A system that breaks text into reusable pieces and handles long numbers carefully. How it works: (1) Isolate digits and split them into 3-digit groups, (2) keep scripts separate, (3) learn frequent merges, (4) use byte fallback when needed. Why it matters: Without it, the model wastes effort on clumsy splits and struggles with math. 🍞 Anchor: “20262026” won’t freeze the tokenizer; it gets chunked safely and quickly.
Step 2. Attention core
- What happens: Layers are arranged 3 local (with RoPE and Sliding Window Attention) then 1 global (NoPE), repeating. Queries/keys are RMS-normalized (QK-Norm) for stability. Grouped-Query Attention (GQA) reduces KV cache by sharing keys/values across multiple query heads. Gated attention scales outputs to reduce attention sinks.
- Why: Long documents require efficiency (local windows), but also need far-away links (global). QK-Norm keeps logits tame; GQA speeds up inference.
- Example: Reading a 200K-token legal file, local layers scan nearby clauses; global layers connect references across sections; the gate turns down repetitive boilerplate.
🍞 Hook: Sharing binoculars in a group so not everyone needs their own pair. 🥬 The Concept (Grouped-Query Attention, GQA): Multiple query heads share the same key/value heads. How it works: (1) Fewer KV heads store context, (2) many query heads read from them, (3) this shrinks the KV cache. Why it matters: Without GQA, memory blows up at long context. 🍞 Anchor: Several readers consult the same index cards instead of each keeping identical stacks.
🍞 Hook: Like wiping fog off your glasses so you don’t squint. 🥬 The Concept (QK-Norm): Normalizes queries and keys before dot-product attention. How it works: (1) RMSNorm on Q and K, (2) compute attention, (3) safer logit ranges. Why it matters: Without it, attention scores can spike and destabilize training. 🍞 Anchor: It keeps attention scores in a friendly zone, avoiding blow-ups.
Step 3. Sparse Mixture-of-Experts with stable routing
- What happens: Each MoE layer has routed experts (only K activate per token) plus an always-on shared expert. Routing uses sigmoid scores; top-K is chosen using score+bias, but output gating uses pure scores. Expert biases are updated with SMEBU for smooth, bounded balancing. A tiny, per-sequence aux loss further evens loads.
- Why: This preserves huge capacity but keeps per-token compute small and stable.
- Example: In Trinity Large, 256 routed experts exist; only 4 activate per token, with fairly large experts to boost throughput.
Step 4. Data pipeline and Random Sequential Document Buffer (RSDB)
- What happens: Training uses three phases, shifting toward more code/math/STEM and higher-quality samples. To avoid unbalanced minibatches (especially with very long docs), the RSDB stores full tokenized documents and fills sequences by randomly sampling from different docs’ read heads.
- Why: Without RSDB, some steps get accidentally dominated by long, same-domain texts, increasing noise, gradient spikes, and instability.
- Example: With RSDB, Batch Heterogeneity (max microbatch loss minus mean) drops sharply; the loss line gets smoother.
🍞 Hook: Making a fruit salad by taking a few bites from many fruits instead of finishing one giant watermelon first. 🥬 The Concept (RSDB): A buffer that holds whole documents and fills training sequences by sampling small chunks from many docs at once. How it works: (1) Keep a pool of documents with read heads, (2) randomly sample docs to fill each sequence, (3) advance read heads, (4) refresh the pool in batches. Why it matters: Without it, batches can be lopsided and noisy. 🍞 Anchor: Every training step tastes like a balanced mix, not just a mouthful of only pineapple.
Step 5. Optimization and stability extras
- What happens: Hidden layers use Muon with width-aware LR adjustment; embeddings/output use AdamW. Depth-scaled sandwich norm stabilizes deep stacks. A small z-loss was added mid-run for logit control. Initial dense layers before MoE help early representations. Intra-document masking prevents cross-document attention in pretraining of Trinity Large.
- Why: These choices reduce loss spikes, logit drifts, and routing collapses.
- Example: Trinity Large’s loss curve is smooth, with zero spikes.
Step 6. Context extension
- What happens: Extend only global layers to long sequences (e.g., 256K) while keeping local window sizes fixed for efficiency. Train at or beyond the target context to improve results. Evaluate with MK-NIAH at target length.
- Why: Only growing global layers speeds up recovery, and training longer than target context improves performance.
- Example: Trinity Large scores ~0.994 at 256K MK-NIAH and even ~0.976 at 512K without training at 512K.
Step 7. Post-training (light SFT + short RL)
- What happens: Supervised fine-tuning on curated instructions (especially agentic coding traces), using cut cross-entropy to save memory; then a brief RL stage with verifiable tasks where possible.
- Why: Adds instruction following and multi-step coding behavior without huge extra compute.
- Example: The model practices “edit-run-test” coding loops, not just final-file outputs.
Secret sauce summary:
- SMEBU: smooth, momentum-based, bounded balancing prevents MoE collapses.
- Interleaved local/global attention: long-context efficiency without losing global links.
- Gated attention + QK-Norm + GQA: stability and memory savings.
- RSDB: fair, steady batches that reduce variance and spikes.
- Tokenizer with number-aware splitting: better compression and arithmetic.
04Experiments & Results
What they measured and why:
- Capability: Coding (MBPP+), math (Minerva MATH500), knowledge (MMLU, MMLU-Pro, TriviaQA, ARC), commonsense (HellaSwag, WinoGrande), and reasoning (BBH, GPQA Diamond). These span the everyday skills users want: write code, solve math, recall facts, reason logically.
- Stability/efficiency: Smooth loss (no spikes), expert load balance (MaxVio staying low), and long-context retrieval (MK-NIAH). Inference throughput on FP8 vLLM.
Who they compared to:
- Prominent open-weight base models, especially GLM 4.5 Base, to see how a highly sparse MoE with fewer active parameters stacks up.
Scoreboard (Trinity Large Base):
- MBPP+: 88.62% (like acing almost 9 out of 10 small coding tasks).
- Minerva MATH500: 65.20% (solid middle-school-to-early-contest math problem solving for a base model).
- HellaSwag (5-shot): 90.11 (strong commonsense completion, like an A grade where many get B’s).
- WinoGrande (5-shot): 80.82 (resolving pronoun references reliably).
- MMLU (5-shot): 82.58 and MMLU-Pro (5-shot): 66.02 (broad knowledge with tougher follow-up showing resilience).
- TriviaQA (5-shot): 83.30 (good fact recall under prompting).
- ARC Challenge (0-shot): 65.44 (reasoning over science facts without examples).
- BBH (few-shot): 65.70 (challenging reasoning tasks, respectable for a base model).
- GPQA Diamond (5-shot): 43.94 (graduate-level questions are hard; the model shows emerging skill but room to grow).
Instruct-tuned (Trinity Large Preview):
- MMLU: 87.21; MMLU-Pro: 75.25; GPQA Diamond: 63.32; SimpleQA: 23.92; AIME25: 24.36. These jumps show the value of even light post-training.
Long-context highlights:
- Trinity Large trained at 256K sequence achieved MK-NIAH @.994 (near-perfect retrieval) and even 0.976 @512K without training at that length; it reached 0.42 @1M, hinting it can be pushed further.
Throughput:
- On with FP8 and vLLM, Trinity Large’s extreme sparsity and interleaved attention deliver strong tokens-per-second, competitive for practical deployment scenarios.
Surprises and notable findings:
- Size helps context extension: larger models recover faster and generalize longer. Training at or beyond the target window measurably improves target performance.
- RSDB sharply cut Batch Heterogeneity and stabilized gradients without dropping tokens; matching its stability with a naive pipeline required much larger batch sizes.
- SMEBU stopped the late-stage routing divergence seen with sign-only updates. Once all stabilizers (SMEBU, sequence-wise aux loss, z-loss, extra dense layers, intra-doc masking) were enabled, training ran smoothly to the end with zero loss spikes.
Context for the numbers:
- An MBPP+ of 88.62% is like earning an A in a practical coding class. HellaSwag 90.11% means the model usually picks the sensible continuation over tricky distractors. MMLU 82.58% (base) vs 87.21% (instruct) shows instruction tuning’s clear payoff.
Bottom line: Compared to dense peers with more active parameters, Trinity Large Base competes strongly while using roughly 2. active parameters and sparsity. Its long-context results are standout among open models.
05Discussion & Limitations
Limitations:
- Post-training was intentionally light due to time constraints, so the instruct model is a preview rather than fully polished; further SFT/RL would likely raise performance.
- Multilingual tokenization wasn’t trained on the final, larger multilingual corpus, so some non-English languages may be under-compressed or underperforming compared to specialized tokenizers.
- MoE still demands careful engineering; although SMEBU helps a lot, extreme sparsity can be brittle if hyperparameters drift.
- Long-context inference, while efficient here, still benefits from strong hardware and memory planning, especially at 256K–512K windows.
- Synthetic data quality (over 8T tokens) is critical; if some synthetic subsets are noisy, certain skills may plateau.
Required resources:
- Significant compute: Trinity Large was trained on 2048 B300 GPUs; smaller models used 512 H200s. Efficient kernels (e.g., Liger), modern PyTorch, and a robust training stack (TorchTitan mods, FSDP, expert parallelism) are needed.
- Data curation infra: Ray + vLLM over Kubernetes for large-scale synthetic data generation and filtering.
- Serving stack: vLLM with FP8 for best throughput; memory-aware KV cache planning.
When not to use:
- Tiny devices without GPU acceleration or apps with very short contexts that don’t need MoE-scale capacity may prefer optimized small dense models.
- Tasks requiring specialized multilingual or domain tokenizers not well covered by the 200k BPE may see better results with targeted tokenizers.
- If you cannot maintain routing/load-balance metrics or tune MoE hyperparameters, a dense model might be simpler operationally.
Open questions:
- Can we push sparsity further (even fewer active experts) without losing accuracy by improving routing and balancing beyond SMEBU?
- What are the best long-context curricula and datasets to reach robust 1M+ context reliably?
- How does Muon interact with other normalization and gating designs at even larger scales and batch sizes?
- What are optimal expert granularity and K-per-token settings for different throughput/quality trade-offs?
- How to systematically evaluate and filter trillion-scale synthetic data to maximize signal per token?
06Conclusion & Future Work
In three sentences: Trinity introduces an open family of sparse Mixture-of-Experts models that keep only a few experts active per token, making them fast and efficient while remaining very capable. A new balancing method, SMEBU, stabilizes routing with smooth, bounded, momentum-guided updates, yielding clean loss curves and avoiding collapsed experts. Interleaved local/global attention, gated attention, and practical engineering (RSDB, Muon, depth-scaled sandwich norm) together deliver strong long-context and benchmark performance at high sparsity.
Main achievement: Showing that a 400B-parameter, highly sparse MoE can train stably end-to-end—with zero loss spikes—by combining SMEBU with a carefully balanced attention stack and a variance-reducing data pipeline.
Future directions:
- Push context reliably to 1M+ with targeted data and possibly light global-layer tweaks.
- Expand multilingual coverage by retraining the tokenizer on the finalized corpus and refining non-English data.
- Deepen post-training (instruction tuning and RL) and agentic coding traces to raise reasoning and tool-use skills.
- Explore increased sparsity, different expert granularities, and adaptive K-per-token for better throughput/quality trade-offs.
Why remember this: Trinity is a blueprint for training and serving very large, open models that are both fast and stable. It shows how gentle, smart balancing (SMEBU) and zoom-in/zoom-out attention can make extreme sparsity practical, opening the door to scalable, affordable long-context AI in real products.
Practical Applications
- •Analyze long legal contracts or compliance documents quickly by scanning locally and linking globally.
- •Understand and refactor large codebases by following cross-file references within very long contexts.
- •Create research assistants that read full papers, appendices, and supplementary data in one pass.
- •Power customer-support agents that remember long histories and policies without slowing down.
- •Do financial or scientific report generation from massive, mixed-format inputs (PDFs, tables, text).
- •Run enterprise-private chatbots with open weights for data control, auditing, and jurisdictional compliance.
- •Teach math and programming with step-by-step reasoning that scales across long examples and traces.
- •Perform retrieval-augmented generation where retrieved chunks are very large but still processed efficiently.
- •Automate multi-step software tasks using agentic coding traces (edit–run–test loops).
- •Summarize or compare entire books, logs, or repositories within a single, stable context window.