Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
Key Summary
- •Step 3.5 Flash is a huge but efficient AI that keeps 196 billion total parameters but only wakes up about 11 billion per token, so it thinks smart and fast.
- •It mixes three speedy Sliding-Window Attention layers with one Full Attention layer, adds tiny gates per head, and predicts multiple next tokens to cut wait time.
- •A new RL method called MIS-PO filters out off-track samples instead of weighting them, making learning stable for long, multi-step reasoning.
- •Special care was taken to keep training rock-solid: improved Muon optimizer use, expert balancing, and detecting hidden activation blow-ups.
- •The team trained on 17.6T pretraining tokens plus a 750B mid-training extension to 128k context, and then refined with SFT, expert RL, and self-distillation.
- •Across coding, math, and agent tasks, it scores near frontier models while using far fewer active parameters (e.g., 85.4% on IMO-AnswerBench, 86.4% on LiveCodeBench-v6).
- •Head-wise gated attention beats fixed sink tokens without slowing the model, and more SWA query heads (96) erase most quality loss from hybrid attention.
- •A special load-balancing trick reduces MoE stragglers in distributed setups, boosting speed when many GPUs cooperate.
- •The model shines in agentic benchmarks (like Terminal-Bench 2.0 at 51.0%) and long-context browsing with context management (69.0%), matching much larger systems.
- •Main limitation: it sometimes writes longer chains of thought than rivals to reach the same quality, so compressing thinking is a next step.
Why This Research Matters
Real work needs agents that read long contexts, use tools, and respond quickly without needing a supercomputer. Step 3.5 Flash proves you can keep a giant knowledge base but only activate a small, fast slice per token, so it fits normal servers. The hybrid attention rhythm plus MTP slashes wait time for multi-turn tasks like coding, browsing, and terminal operations. MIS-PO keeps RL stable on long chains of thought, so the model can genuinely improve at planning and execution. This lets companies deploy powerful assistants that pass tests, fix bugs, and write reports with fewer resources. For users, that means faster help, better reliability, and less cost.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
Imagine a school science fair where some teams are super fast but only do simple projects, while others build amazing robots but take forever to finish. For a long time, open-source AI felt like that: either quick but not very deep, or deep but too slow and costly to use as a real assistant. As AI shifted from just answering questions to acting like an agent (planning steps, using tools, browsing, coding, and fixing errors), two pains got loud: deep reasoning still lagged behind the best closed models, and the system slowed down badly on long, multi-turn tasks. Before this work, many models used one-size-fits-all layers for every token. That’s simple, but wasteful: you pay the full price for every part of the brain, even if only a few parts are truly needed. Attention also grew expensive as context got longer (because standard attention checks every word against every other word), and reinforcement learning (RL) on long, branching reasoning paths often shook itself apart—tiny differences early on snowballed into noisy, unstable updates. People tried a bunch of fixes. Dense models got bigger to think better, but the bill came due at inference time: memory blew up, latency hurt, and deployment at scale got tricky. Some switched to linear attention to tame long contexts, but that clashed with fast speculative decoding and didn’t clearly win on agent tasks. Others added “sink tokens” to absorb useless attention, but those were dumb placeholders that couldn’t adapt to each input. In RL, importance sampling with ratios (like PPO for LLMs) seemed okay on short tasks, but in very long, off-policy reasoning it became twitchy—tiny token probability changes multiplied into huge gradient swings. Meanwhile, mixture-of-experts (MoE) promised speed by only activating a few experts per token—but routing imbalances caused GPU “stragglers,” and dangerous training pathologies lurked: experts would silently collapse or some experts’ activations would explode without the loss even noticing. The missing piece was a whole-system design: match the model architecture to real agent workloads, keep the brain big but the active part small, make attention long but cheap, predict multiple tokens to reduce waiting, and make RL stable for long reasoning paths even inside an MoE. All that needed careful training infrastructure, load balancing, and early-warning signals for hidden failures. Enter Step 3.5 Flash. It’s built to be both sharp and fast: a 196B-parameter MoE backbone, but with only 11B “awake” per token; an attention layout that mostly looks nearby (sliding window) but refreshes global memory regularly (full attention); head-wise gates that adaptively damp useless attention; and multi-token prediction (MTP-3) so you can draft several tokens and confirm them quickly. Under the hood, it monitors training like a hawk, stabilizes MoE routing, and clips dangerous activations before they blow up. For RL, it replaces jittery ratio-weighted updates with MIS-PO, which simply filters away off-distribution samples at both token and trajectory levels, so learning stays calm even on very long thoughts. Why should anyone care? Because many real jobs are long, fiddly, and interactive: programming with tests and PRs, browsing multiple sources then writing a report, or operating a terminal to fix issues. Latency (how long you wait) equals money. Reliability equals trust. This research shows you can get frontier-like results with a small active brain, so it fits on regular servers, runs fast, and still handles big contexts—making sophisticated agents practical for companies, researchers, and even advanced personal workflows.
02Core Idea
The aha! moment in one sentence: Keep a giant library of knowledge (many experts) but only open the few shelves you need right now, use a fast local-to-global attention rhythm, guess several next words at once, and train with an RL filter that ignores off-track samples so learning stays smooth. Three analogies:
- Library with a speedy librarian: The library is huge (many experts), but the librarian fetches just a handful of the right books for each question. Most of the time you re-check your last notes (sliding window), but every fourth step you glance at the big map (full attention). You even draft a few sentences ahead (MTP), then quickly check they make sense. If a study path veers off, you discard it (MIS-PO) and keep only the trustworthy trails.
- Cooking with stations: A pro kitchen has many stations (experts). For each dish, only a few stations cook. The chef mainly watches the last few steps (SWA) but periodically checks the whole kitchen (full attention). They prep several steps in advance (MTP). For training new chefs (RL), you throw out practice runs that don’t match the plan (MIS-PO) instead of trying to fix them with awkward weighting.
- Hike with checkpoints: Most steps you look a few meters ahead (SWA), then sometimes you check the panorama (full). You plan a few steps in advance (MTP). If your breadcrumb trail disagrees too much with the safe route (MIS-PO), you restart from a solid checkpoint. Before vs After:
- Before: Dense layers everywhere, high long-context cost, speculative decoding friction, sink tokens that can’t adapt, and ratio-weighted RL that shakes on long chains.
- After: Sparse MoE lights up only what’s needed, S3F1 attention layout is context-savvy and fast, head-wise gates adaptively absorb noise, MTP reduces latency, and MIS-PO filters out off-policy mess to steady learning. Why it works (intuition):
- Sparse MoE raises total capacity but lowers per-token compute—like owning a city library but checking out only two books.
- Hybrid attention (3 SWA + 1 full) cuts the quadratic long-context work while keeping periodic global links.
- Head-wise gates act like smart volume knobs per attention head—they turn down heads when nearby text is unhelpful.
- MTP-3 drafts multiple tokens; verification is cheap on bandwidth-bound hardware, so you gain wall-clock speed.
- MIS-PO converts risky ratio-scaling into a clean keep-or-drop rule at token and trajectory levels, which slashes gradient noise over long horizons. Building blocks (with the Sandwich pattern): 🍞 Hook: You know how a big school has many teachers, but you only meet with a few who can help with your exact question? 🥬 The Concept: Sparse Mixture-of-Experts (MoE) lets the model keep many specialized “experts” but activates only a small set per token.
- How it works: (1) Tokens ask a router which experts to visit; (2) Top-k experts process the token; (3) Outputs are combined; (4) Extra losses encourage balanced usage so no GPU gets stuck as a straggler.
- Why it matters: Without MoE, the whole network runs every time—expensive and slow. MoE keeps capacity high but compute low. 🍞 Anchor: When solving a geometry problem, only math experts wake up; writing experts nap—so answers come faster. 🍞 Hook: Imagine reading where you mostly look at the last page you read, plus sometimes peek back further to remember the plot. 🥬 The Concept: Sliding-Window Attention (SWA) focuses on a recent window, and every few layers Full Attention reconnects long-range info.
- How it works: (1) SWA attends over the last W=512 tokens; (2) Every fourth layer, Full Attention scans all; (3) SWA uses more query heads (96) to keep quality; (4) GQA-8 aligns memory for speed.
- Why it matters: All-full attention is pricey for long contexts. SWA keeps decoding fast while still refreshing global links. 🍞 Anchor: In coding chats, it mostly needs the last few steps, but occasionally revisits earlier files to stay consistent. 🍞 Hook: When you write, you sometimes think a few words ahead. 🥬 The Concept: Multi-Token Prediction (MTP-3) drafts several future tokens and then verifies them quickly.
- How it works: (1) Add tiny MTP heads predicting t+2, t+3, t+4; (2) Train MTP-1 first, clone to others; (3) Use position-dependent losses; (4) Verify drafts with fast checks.
- Why it matters: Predicting one token at a time is slow. Drafting multiples reduces latency without hurting accuracy. 🍞 Anchor: It’s like typing three words you’re confident about and autocorrect confirming them at once. 🍞 Hook: Volume knobs let you turn down noisy music. 🥬 The Concept: Head-wise Gated Attention gives each attention head a tiny, data-dependent gate to damp unhelpful focus.
- How it works: (1) Compute a scalar gate per head from the input; (2) Multiply head outputs by the gate; (3) Adapt per token; (4) Negligible overhead.
- Why it matters: Fixed sink tokens can’t adapt, but gates are input-aware—less noise, better quality. 🍞 Anchor: If no useful info is nearby, the gate turns that head down so the model doesn’t get distracted. 🍞 Hook: Learning to ride a bike means trying, getting feedback, and improving. 🥬 The Concept: Reinforcement Learning (RL) rewards good behavior and nudges the model toward better reasoning and tool use.
- How it works: (1) Generate; (2) Score with verifiers or preference models; (3) Update policy and value; (4) Repeat.
- Why it matters: Pure imitation plateaus; RL sharpens decision-making over long tasks. 🍞 Anchor: The model practices coding tasks, gets points for passing tests, and learns smarter fixes. 🍞 Hook: If a compass points way off, you don’t keep walking—you reset. 🥬 The Concept: MIS-PO filters samples that drift too far from the training policy instead of reweighting them.
- How it works: (1) Compute token and trajectory ratios; (2) Keep if inside safe bounds; (3) Drop if off-distribution; (4) Treat kept data as on-policy for low-variance updates.
- Why it matters: Long chains make importance weights unstable. Filtering keeps gradients calm and learning steady. 🍞 Anchor: If a drafted solution diverges, MIS-PO says “skip this one” rather than forcing a shaky correction.
03Methodology
At a high level: Input tokens → Embedding → Hybrid Attention blocks (3 SWA + 1 Full) with head-wise gates → MoE feed-forward (288 experts, top-8 active) → LM head + MTP heads → Output tokens. Then train in stages: pre-train (4k→32k), mid-train (32k→128k), post-train (SFT → expert RL → self-distill → scalable RL with MIS-PO). Step A: Hybrid Attention for speed with reach.
- What happens: For three layers, the model uses Sliding-Window Attention (W=512) with 96 query heads; the fourth layer uses Full Attention with GQA-8. Tiny head-wise gates modulate each head’s output.
- Why it exists: Long contexts make full attention pricey. SWA keeps decoding cheap while the periodic full layer refreshes long links. Gates adaptively absorb noise when the local window has little signal.
- Example: You’re editing code: most of the time you only need the last function you touched; sometimes you scan the whole file for definitions. What breaks without it: All-full means slow prefill/decoding; all-window means you forget far-away definitions and drift. Step B: Sparse MoE backbone with EP-group balancing.
- What happens: Each token is routed to k=8 experts out of 288 plus one shared expert. A loss encourages balanced loads across expert-parallel groups to avoid GPU stragglers.
- Why it exists: Use lots of total capacity but keep per-token compute and wall-clock latency low; keep distributed training and inference smooth.
- Example: During math, calculus experts work more; for code, parsing/build experts wake up; the router balances traffic so no one GPU becomes a bottleneck. What breaks without it: Either you compute too much (dense) or GPUs idle while one overloaded rank stalls. Step C: Multi-Token Prediction (MTP-3) for latency cuts.
- What happens: Three small MTP heads draft tokens at offsets beyond the standard LM head. Train MTP-1 during main phases, then clone to MTP-2/3 and fine-tune briefly.
- Why it exists: On bandwidth-bound hardware, verifying multiple drafted tokens in parallel brings big wall-clock wins.
- Example: Autocomplete proposes three words; a quick checker confirms them faster than typing one-by-one. What breaks without it: You generate too slowly for interactive agents. Step D: Pre-training curriculum with stability guardrails.
- What happens: 17.6T tokens at 4k→32k context, then mid-training 750B tokens to 128k. Use improved Muon, careful RoPE scaling, freeze router in mid-training, and monitor for expert collapse and activation blow-ups. Apply activation clipping inside experts where needed.
- Why it exists: MoE training can fail silently (dead experts, norm blow-ups). The observability stack and clipping keep training smooth.
- Example: The metrics server watches per-expert norms; if a few experts spike, activation clipping dials them back. What breaks without it: Loss may look fine while internal states explode, leading to sudden crashes later. Step E: Post-training: SFT → Expert RL → Self-Distill → Scalable RL (MIS-PO).
- What happens: (1) Broad SFT on reasoning, code, STEM, tool use, long context; (2) Train domain experts with verifiable rewards; (3) Self-distill experts into a single student; (4) Run RL with MIS-PO across mixed tasks using RLVR (rule/model verifiers) and GenRM for preferences.
- Why it exists: Keep a single strong generalist without losing specialist skills; stabilize long-horizon learning.
- Example: Code expert solves repos; math expert tackles contest problems; the best patterns are distilled; MIS-PO then polishes the generalist safely. What breaks without it: Either you juggle many models (costly), or one model forgets specialist skills; PPO-style updates jitter on long chains. Step F: Rewards and safety shaping.
- What happens: Verifiable tasks use checkers (tests, logic rules); STEM uses model verifiers; preferences use a generative reward model (GenRM) with MetaRM penalizing spurious reasoning; agent tasks use rubric-based judges with asymmetric rewards.
- Why it exists: Clearer, more reliable signals beat vague thumbs-up/down. Penalizing wrong-but-confident chains keeps reasoning honest.
- Example: If a code fix passes unit tests, that’s a strong reward; if a report cites fake links, reward is zero. What breaks without it: The model overfits to length, hallucinates citations, or optimizes the wrong objective. Step G: MIS-PO details (the secret sauce).
- What happens: Compute token-level and trajectory-level ratios between training and inference policies. Keep samples only if inside tight bounds (e.g., token [0.5,2], trajectory ~[0.996,1.001]); treat kept data as on-policy; use truncation-aware value bootstrapping; monitor routing confidence as a stability proxy.
- Why it exists: Long reasoning magnifies tiny probability shifts; filtering reduces variance far more than bounding does.
- Example: If a path drifts from the intended distribution, skip it; otherwise, learn from it strongly and calmly. What breaks without it: Ratio-weighted methods spike gradients, slow convergence, or destabilize MoE routing. Data pipeline and agent infrastructure (brief):
- Data synthesis includes verified tool-use trajectories (finite state machines + execute-verify), code environments with unit tests and error recovery, and browsing/report tasks with graph-based multi-hop checks. Templates retain only the latest tool-use thread to save context. XML tool calls reduce parsing errors compared to strict JSON in smaller models.
- Outcome: High-density, verifiable supervision that matches real agent workflows.
04Experiments & Results
The test: Can we get frontier-like intelligence with only 11B active parameters, keep latency low, and stay stable on long, tool-using tasks? The team measured reasoning (AIME 2025, HMMT 2025, IMO-AnswerBench), coding (LiveCodeBench-v6, SWE-Bench variants), agents (Terminal-Bench 2.0, BrowseComp with context management, GAIA, Ď„-Bench), general knowledge (MMLU-Pro, GPQA-Diamond, Arena-Hard v2), and long-context (LongBench v2, FRAMES, RepoQA, MRCR). The competition: Strong open and closed baselines like DeepSeek V3.2, GLM 4.7, MiMo V2 Flash, Kimi K2.5, plus frontier closed models like GPT-5.2 xHigh, Gemini 3.0 Pro, and Claude Opus 4.5. Scoreboard with context:
- Reasoning: 97.3% on AIME 2025 (vanilla, 99.9% with PaCoRe), 94.0–98.4% on HMMT 2025 variants with PaCoRe, and 85.4% on IMO-AnswerBench (88.8% with PaCoRe). That’s like scoring A+ where many models get solid As.
- Coding: 86.4% on LiveCodeBench-v6 (88.9% with PaCoRe), competitive with top systems; strong generalization across SWE-Bench Verified and Multilingual.
- Agentic: 51.0% on Terminal-Bench 2.0; 69.0% on BrowseComp with context management; 88.2% on Ď„-Bench. These are tough multi-step, tool-heavy settings; Step 3.5 Flash lands near or at frontier levels despite using far fewer active parameters.
- General & long-context: 84.4% on MMLU-Pro; 83.5% on GPQA-Diamond; solid on LongBench v2 and FRAMES; RepoQA reaches 88.5%, indicating robust repository-level reasoning. Surprising findings:
- Hybrid attention needs care: A plain 3:1 SWA:Full layout drops quality, but bumping SWA query heads to 96 and adding head-wise gates regains most of it with almost no latency cost. Alternating S1F1 can be slightly higher quality but costs ~60% more attention FLOPs, so S3F1+Head is the sweet spot.
- Head-wise gates beat sink tokens: On a 100B MoE, adaptive head gating outperforms fixed sink tokens across pretraining benchmarks. Tiny change, clear win.
- Hidden instabilities don’t show in loss: Some experts’ activations silently blow up near the end of the network. Monitoring max-to-median expert norms reveals the issue; activation clipping fixes it better than offline weight clipping.
- MIS-PO vs PPO: MIS-PO shows lower gradient noise, steadier entropy, and faster reward climb on long-horizon tasks. Dropping off-distribution samples is simpler and more stable than weighting them.
- Capacity density matters: Despite only 11B active parameters, Flash matches or beats much larger models on several tests, especially when combined with PaCoRe test-time scaling, proving the architecture’s efficiency frontier has shifted. What the numbers mean in everyday terms: It’s like a team finishing a complex group project as fast as the top private school, while using fewer people per task and wasting less time switching contexts. They still check the big picture often enough to stay coordinated, and they practice in a way that avoids learning from confusing or off-track drills. PaCoRe synergy: Because the base model is very fast, running many parallel reasoning paths and then combining them pays off. Across math, coding, and research, PaCoRe delivers notable deltas (e.g., +3–10% on some hard sets), turning speed headroom into quality gains without retraining. Bottom line: Step 3.5 Flash redefines the efficiency frontier: frontier-like accuracy and tool-work reliability with only 11B active parameters, strong long-context skills, and stable RL—evidence that careful co-design can beat brute-force scale.
05Discussion & Limitations
Limitations:
- Token efficiency: To match rivals’ best outputs, Flash sometimes thinks longer (more tokens). Compressing the chain-of-thought without losing accuracy is an open goal.
- Distribution shifts: In very specialized domains or very long, multi-turn chats, it may repeat, mix languages, or slip on time/identity details.
- Tool template choices: XML improves robustness for smaller models, but enterprise stacks may require strict JSON; translations must be monitored. Required resources:
- Training used a 4,096 H800 GPU cluster, hybrid parallelism, and careful communications. While inference is lean (11B active), reproducing training requires significant compute, strong telemetry, and MoE-aware engineering. When NOT to use:
- Ultra-low-latency, ultra-low-cost edge devices without GPU memory for the KV cache and MoE structures.
- Tiny-batch streaming where full-attention refreshes are infeasible and you cannot exploit MTP/speculative verification.
- Settings requiring extremely short answers by design (e.g., fixed-form responses) where the longer chain-of-thought offers little benefit. Open questions:
- Can we compress reasoning tokens via learned summarization or on-policy distillation while keeping accuracy?
- Can routing confidence be actively controlled to make off-policy RL even more robust without router replay?
- What’s the best universal hybrid attention schedule beyond 3:1? Can schedules adapt online to task structure?
- Can MIS-PO bounds be auto-tuned per domain to balance sample efficiency and stability?
- How far can verified tool-use pipelines scale across enterprise software stacks with non-uniform APIs, rate limits, and flaky environments?
06Conclusion & Future Work
Three-sentence summary: Step 3.5 Flash is a big-but-efficient MoE model that lights up only 11B active parameters per token, runs a hybrid attention rhythm (3 SWA + 1 Full) with head-wise gates, and drafts multiple tokens at once to cut latency. For learning, it replaces jittery ratio-weighted RL with MIS-PO, a filter that keeps only near-on-policy samples, stabilizing long-chain reasoning—even in an MoE. The result is frontier-level accuracy on math, coding, and agent tasks while redefining efficiency for real-world deployment. Main achievement: Proving that careful co-design—sparse experts, hybrid attention with smart gating, MTP-3, and MIS-PO—can deliver frontier intelligence with a fraction of the active parameters and practical latency for agentic workloads. Future directions: Compress chains-of-thought to reduce tokens while keeping quality; refine on-policy distillation for better sample efficiency; auto-tune routing and MIS-PO thresholds; and extend verified tool-use pipelines to broader, messier real-world environments. As attention layouts and filtering RL evolve together, we expect even better stability, speed, and adaptability. Why remember this: It shows that “small active brain, big library” is not just a neat idea—it works at scale, stays stable, and makes sophisticated agents fast enough for everyday use.
Practical Applications
- •High-throughput coding assistants that open PRs, fix bugs, and pass unit tests in enterprise CI systems.
- •Research copilots that browse multiple sources, check citations, and produce rubric-aligned reports.
- •Terminal agents that automate server maintenance tasks with safer, verifiable feedback loops.
- •Customer support copilots that search knowledge bases, follow tool protocols, and return concise, correct answers.
- •Data engineering helpers that read large repositories (RepoQA), refactor pipelines, and explain diffs.
- •STEM tutors that solve competition math and explain steps clearly while resisting spurious reasoning.
- •Internal documentation synthesizers that digest multi-file repos and long docs into accurate summaries.
- •Secure workflow bots that execute only verified tool calls (FSM-based) and avoid hallucinated actions.
- •On-prem enterprise assistants that maintain low latency while handling 128k+ token contexts.
- •Evaluation frameworks that use MIS-PO to stabilize RL on long, multi-turn agent training.