2Mamba2Furious: Linear in Complexity, Competitive in Accuracy
Key Summary
- •The paper studies Mamba-2 (a fast, linear-time attention method) and pares it down to the pieces that truly boost accuracy.
- •It then introduces 2Mamba, which squares the query–key match scores to make the model more expressive while keeping memory fixed with respect to sequence length.
- •A learned decay mask (A-mask) built with a softplus function and a tiny input convolution (window size 2) are the biggest accuracy boosters.
- •2Mamba reaches almost the same accuracy as classic softmax attention but uses much less memory for long contexts (over ~1,000 tokens per head with 64-dim heads).
- •Time discretization can help small models but caused instability in medium models under low-precision math, so they drop it for stability.
- •On long-context retrieval (Needle-in-a-Haystack), 2Mamba slightly beats softmax and clearly outperforms Mamba-2, showing strong context use.
- •An exponentiated variant, 2Mamba-E, effectively becomes softmax attention with a forget gate and small convolution, and slightly outperforms standard softmax (but needs a KV cache).
- •Custom Triton kernels and an efficient way to form only the unique second-order features make 2Mamba practical.
- •Overall, the work narrows the accuracy gap between linear and softmax attention while preserving linear-time training and fixed-memory inference.
Why This Research Matters
Long documents, codebases, and conversations push memory limits. 2Mamba brings near-softmax accuracy without the growing memory costs, so we can handle very long contexts on smaller, cheaper hardware. This makes assistants more helpful on laptops, phones, or modest GPUs where KV caches would be too expensive. It also lowers serving costs in the cloud by keeping memory steady as prompts get longer. Better long-context retrieval means models can actually use the extra context efficiently, not just store it. Finally, connecting linear attention to forget-gated softmax designs opens a path to even smarter, more controllable long-context models.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: Imagine you have to whisper a secret across a really long line of friends. If everyone can talk to everyone, it gets crowded and slow. But if you pass the message smartly, person to person, you can go fast without shouting.
🥬 The Concept: Before this paper, most language models used softmax attention, which lets every word talk to every other word at once. That’s very accurate but the cost grows fast as the sentence gets longer (like a classroom getting noisy when everyone talks at the same time). To go faster, people built linear attention methods where information moves more like a line—steady, predictable, and much cheaper as things get long. The trouble? Linear attention was usually less accurate.
- What it is: A story about speed vs. smarts. Softmax is very smart but expensive; linear attention is fast but often less sharp.
- How it works (history):
- Softmax attention becomes the default because it’s highly accurate.
- People invent linear attention to save time and memory for long texts.
- New ideas like Mamba and Mamba-2 add clever pieces (like a learned decay mask and parallel tricks) to make linear attention more expressive.
- Still, a gap in accuracy remains compared to softmax.
- Why it matters: If we can keep the speed and memory benefits of linear methods while matching softmax accuracy, we can handle very long documents, fit bigger contexts on cheaper hardware, and run models on devices with tight memory.
🍞 Anchor: Think of reading a huge book on a tablet with a tiny battery. Softmax drains the battery fast; linear attention sips power slowly but may miss details. This paper shows how to sip power and still catch the important plot twists.
New concept 1 — Linear Attention 🍞 Hook: You know how you can skim a book by glancing at one page at a time instead of rereading the whole book every time? 🥬 The Concept: Linear attention lets the model “skim” efficiently by summarizing what it has seen into a running state.
- What it is: A faster way to focus that grows gently with length.
- How it works: It turns the matching step (queries vs. keys) into something that can be decomposed and updated step-by-step, like updating a running tally.
- Why it matters: Without it, long inputs get too slow or too memory-hungry. 🍞 Anchor: When answering questions about a 5,000-word article, linear attention helps the model keep a compact running summary instead of storing every detail.
New concept 2 — Softmax Attention 🍞 Hook: Imagine a teacher asking, “Which words are most important right now?” and the class votes. 🥬 The Concept: Softmax attention gives each word a probability-like score so the model focuses on the most relevant parts.
- What it is: A very accurate, all-to-all information routing method.
- How it works: Compare the query to all keys, exponentiate the scores, normalize them so they sum to 1, and mix the values.
- Why it matters: Without it, the model can’t easily pick the truly important words. 🍞 Anchor: Asked “What’s the capital of France?”, softmax attention zooms in on “capital” and “France” to answer “Paris.”
New concept 3 — Complexity (Quadratic vs. Linear) 🍞 Hook: If every kid in class high-fives every other kid, that’s lots of high-fives! If each kid only high-fives their neighbor, it’s way fewer. 🥬 The Concept: Softmax attention’s work grows like the square of the sequence length; linear attention grows roughly linearly.
- What it is: A measure of how cost rises as inputs get longer.
- How it works: All-to-all (softmax) vs. streaming-style (linear).
- Why it matters: Long contexts become impossible if costs explode. 🍞 Anchor: For 10,000 tokens, quadratic can be painfully slow while linear keeps moving.
New concept 4 — KV Cache 🍞 Hook: Think of a scrapbook where you keep every page you’ve seen so you can look back any time. 🥬 The Concept: A KV cache stores keys and values for each past token at inference.
- What it is: Memory that grows with the sequence.
- How it works: Each new token looks back over stored past info.
- Why it matters: Powerful but memory grows linearly with length. 🍞 Anchor: A chatbot remembering a long conversation keeps a KV cache; the longer you chat, the more memory it needs.
What was missing: We had fast methods (linear) and accurate methods (softmax), and even improved linear ones like Mamba-2. But we needed a way to make linear attention nearly as expressive as softmax without giving up its memory benefits for long contexts.
Real stakes: This affects anyone who wants long context—researchers scanning hundreds of pages, coders navigating huge files, students studying textbooks, or phones and small GPUs running assistants without running out of memory.
02Core Idea
New concept 5 — Mamba-2 🍞 Hook: Picture a smart runner who’s fast because they pace themselves and remembers what matters from each mile. 🥬 The Concept: Mamba-2 is a linear-time attention-like model that adds a learned decay (forget) mask and other tweaks to be more expressive.
- What it is: A strong linear attention variant with a learned way to fade older information.
- How it works: It builds queries, keys, values with a tiny convolution, applies a learned A-mask that decays over distance, and aggregates efficiently.
- Why it matters: It narrows the accuracy gap with softmax while staying efficient. 🍞 Anchor: For a long story, Mamba-2 decides how much to remember from earlier chapters using a learned “forget” schedule.
The “Aha!” in one sentence: If we square the query–key match and pair it with a better learned decay mask and a small input convolution, we can make linear attention almost as accurate as softmax while keeping memory fixed with respect to sequence length.
Explain it three ways:
- Vision: Regular linear attention is like a camera with a simple lens; squaring the match is like adding a sharper lens that reveals more detail without making the camera heavier.
- Music: The basic melody (linear attention) sounds flat; adding second-order harmonics (squared match) enriches the song without needing a giant orchestra (big memory).
- Cooking: You’re making soup (attention). The squared match is the umami that makes flavors pop; the A-mask is a timer that decides when older flavors should fade.
Before vs. After:
- Before: Linear attention is efficient but often less accurate. Mamba-2 helps, but there’s still a gap.
- After: 2Mamba (squared inner product + softplus A-mask + tiny conv) reaches near-softmax accuracy while keeping memory usage flat as the sequence grows.
Why it works (intuition):
- Softmax can be seen as adding up higher and higher powers of the match between queries and keys. Linear attention is like the first power (too simple). Squaring the match adds the second power—much more expressive but still manageable.
- Squaring also makes all scores non-negative, letting us use stable “softmax-like” normalization without storing a whole KV cache.
- The A-mask learns how quickly to “forget” distant tokens, focusing attention where it’s most useful.
- A tiny input convolution (window size 2) gives a bit of local context to Q, K, V, which surprisingly helps a lot.
Building blocks (the key pieces): New concept 6 — A-mask (Decay Mask) 🍞 Hook: Think of yellow sticky notes that fade over time so only fresher notes stay bright. 🥬 The Concept: A learned mask that smoothly lowers the influence of far-away tokens.
- What it is: A per-step learned “forget” gate.
- How it works: Predict a negative score per position, accumulate across time, and turn differences into decays; softplus makes it stable and always negative.
- Why it matters: Without it, the model doesn’t know what to forget and may spread attention too thinly. 🍞 Anchor: In a 5,000-word article, the A-mask helps the model favor the most relevant parts while still remembering key earlier bits if needed.
New concept 7 — Higher-Order Hidden States (Second Order via Squaring) 🍞 Hook: Remembering a story is easier when you remember pairs of ideas that go together, not just single facts. 🥬 The Concept: Using second-order features (like pairs of dimensions) by squaring the Q–K match makes the model’s memory richer.
- What it is: A compact way to approximate parts of softmax’s expressiveness.
- How it works: Form unique second-order combinations efficiently (no duplicates) to keep memory modest.
- Why it matters: Without this, linear attention often lacks the nuance softmax captures. 🍞 Anchor: When reading, you don’t just recall “dog” and “park” separately—you recall “dog at the park,” which is a second-order memory.
New concept 8 — 2Mamba and 2Mamba-E 🍞 Hook: Imagine two power-ups: one gives you almost all the strength without extra weight; the other gives you a bit more strength but needs a backpack. 🥬 The Concept: 2Mamba (squared scores) keeps fixed memory and near-softmax accuracy; 2Mamba-E (exponentiated scores) slightly beats softmax but needs a KV cache.
- What it is: Two variants that trade memory for accuracy differently.
- How it works: Squared vs. exponentiated match, both with A-mask and small conv.
- Why it matters: You can choose based on hardware limits and accuracy needs. 🍞 Anchor: On a small GPU, use 2Mamba; on a big server, 2Mamba-E can squeeze a bit more accuracy.
03Methodology
At a high level: Tokens in → tiny convolution to make Q, K, V → build a learned decay (A-mask) → compute squared query–key matches → apply decay and causal masks → normalize stably → mix with values → project to output.
Step-by-step recipe (2Mamba):
- Input and tiny convolution
- What happens: Take the hidden states and pass them through a causal 1D convolution with window size 2 to produce Q, K, V (per head). No activation is needed.
- Why it exists: A little local context (just one neighbor) makes Q, K, V smarter. Ablations showed this gives a strong accuracy lift with tiny cost.
- Example: For tokens [t1, t2, t3], Q at t3 can peek at t2, helping it sense local patterns like “is t3 continuing a phrase from t2?”
- Build the A-mask (learned decay)
- What happens: A small linear layer predicts a negative number per position (after softplus), then you take a running sum over time. The difference between any two positions turns into an exponential decay. Apply the usual causal mask so positions can’t see the future.
- Why it exists: It learns how fast to forget, helping focus attention on the most relevant past.
- Example: If A at position 10 says “forget slowly,” then tokens from positions 1–9 still have influence at 10; if it says “forget quickly,” old tokens fade fast.
- Compute squared query–key match
- What happens: Compute Q·K^T per head and square it. This makes matches non-negative and adds second-order expressiveness.
- Why it exists: Squaring approximates richer interactions (second-order) without jumping all the way to softmax’s full complexity.
- Example: If Q matches K with score 0.7, squaring gives 0.49, but crucially all scores are non-negative and structured in a way we can normalize stably.
- Apply masks and normalize
- What happens: Multiply the squared scores by the A-mask and the causal mask. Then normalize across each query’s row (softmax-like, computed online in a stable way without keeping a full KV cache).
- Why it exists: The A-mask prunes distant noise, the causal mask enforces direction, and normalization keeps numbers stable and interpretable.
- Example: If three past tokens have raw scores [2.0, 1.0, 0.5] after decay, normalization turns them into weights that sum to 1, like [0.57, 0.29, 0.14].
- Mix with values and project
- What happens: Use the normalized weights to combine the values V and then linearly project back to the model dimension.
- Why it exists: This is the final “answer” from attention, ready for the rest of the transformer block.
- Example: If the second token gets the most weight for the current query, its value contributes most to the output.
The secret sauce:
- Squared scores = second-order features. You get a big expressiveness jump with a controlled memory footprint.
- A softplus A-mask = a smooth, learned forget gate that beat the original design in ablations.
- Tiny convolution = surprisingly large accuracy gain for very little cost.
- Efficient second-order construction = compute only the unique pairwise products (no duplicates), saving memory and time.
- Stable online normalization = softmax-like behavior without a growing KV cache.
What breaks without each step:
- Remove the convolution: noticeable accuracy drop; Q, K, V become too “local-blind.”
- Remove the A-mask: the model can’t decide what to forget; attention spreads or fixates poorly.
- Don’t square the match: you lose the second-order boost and positive scores; normalization becomes trickier.
- Skip normalization: weights get unstable; training can blow up or underperform.
Implementation notes from experiments:
- Time discretization: helped a bit on small models but caused instability on medium models with low-precision math (TF32). They dropped it in 2Mamba for stability.
- Triton kernels: custom kernels implement the squared variant and efficient second-order features, crucial for speed and stability.
- Memory crossover: With 64-dim heads, 2Mamba uses less memory than a KV cache once the context is longer than roughly 1,000 tokens per head, while keeping accuracy near softmax.
New concept 9 — Normalization (Output vs. Softmax-like) 🍞 Hook: Balancing a seesaw is easier when everyone knows how much they weigh. 🥬 The Concept: You can normalize either at the end (output norm) or directly on attention scores (softmax-like). Squaring enables softmax-like normalization without going full softmax.
- What it is: Different places to “balance” the numbers.
- How it works: Squared scores are non-negative, making online normalization stable.
- Why it matters: Stability and accuracy both improve with the right normalization. 🍞 Anchor: It’s like normalizing the votes before mixing them, so no single token yells too loudly.
New concept 10 — Needle-in-a-Haystack (NIAH) Test 🍞 Hook: Hide a secret word somewhere in a long story and see if the model can find it later. 🥬 The Concept: A benchmark that checks if the model can recall a tiny but crucial piece of information inside long contexts.
- What it is: A retrieval stress test.
- How it works: Insert a “password” in a long prompt and ask the model to repeat it at the end.
- Why it matters: Shows whether long context is truly useful, not just present. 🍞 Anchor: 2Mamba slightly beats softmax on this test and clearly outperforms Mamba-2.
04Experiments & Results
The test and why: They trained Llama-2–style models where they swapped the attention block and measured test loss (lower is better) on next-token prediction. They also ran long-context tests (NIAH) to see if the model really used big contexts well.
Who they compared against:
- Vanilla linear attention (fast but less accurate)
- Mamba-2 (a strong linear baseline)
- Softmax attention (gold-standard accuracy, but memory-hungry)
- Their variants: Mamba-2S (simplified Mamba-2), 2Mamba (squared), and 2Mamba-E (exponentiated)
Ablations (what matters most):
- Biggest wins: softplus A-mask and input convolution (window size 2)
- Minor wins: discretization (but unstable on bigger models), small activations in places
- Little or negative: Z-gate and certain residual tweaks
Scoreboard with context:
- Mamba-2 vs. linear: Mamba-2 is much closer to softmax than vanilla linear; think of moving from a C grade to a solid B+/A- in many setups.
- Mamba-2S (simplified): With just the softplus A-mask + tiny conv (+ sometimes discretization), it matches or slightly improves over full Mamba-2 in their tests—like keeping the A- while cutting extra knobs.
- 2Mamba (squared): Reaches near-softmax test loss across sequence lengths (2k, 4k, 8k). That’s like going from an A- to an A without paying the full memory bill.
- 2Mamba-E (exponentiated): Slightly better than softmax; an A+—but needs a KV cache.
Long-context retrieval (NIAH):
- Setup: Train up to 8192 tokens and run the “needle” benchmark.
- Result: 2Mamba slightly outperforms softmax and beats Mamba-2. That means it not only learns to predict the next token well, it also truly uses the long context to find hidden facts.
Surprising findings:
- The tiny convolution (window 2) is a star: very small change, surprisingly big benefit.
- The softplus A-mask variant outperformed the original mask approach.
- Time discretization offered small gains but caused instability in medium models under TF32; removing it made training smooth while keeping accuracy high.
- Exponentiating the match (2Mamba-E) ties back to known transformer ideas (like Forgetting Transformer), and even edges out softmax.
Memory crossover point:
- With 64-dim heads, 2Mamba uses less memory than softmax’s KV cache once context exceeds roughly 1,000 tokens per head, while maintaining near-softmax accuracy. That’s a big win for long documents, logs, and codebases.
05Discussion & Limitations
Limitations (be specific):
- Head dimension sensitivity: The squared second-order state scales with head size; very large head dims can make the fixed state big. Keeping head dims modest (e.g., 64) preserves the memory advantage over KV caches at long contexts.
- Numerical stability: Time discretization improved small models a bit but led to divergence in medium models with TF32. Stable training often requires careful kernel precision; they dropped discretization in 2Mamba for stability.
- Custom kernels needed: Their efficient implementation relies on Triton kernels and careful online normalization. Plain PyTorch ops may be slower or less stable.
- Not universally above softmax: 2Mamba is “near” softmax. If you need every last bit of accuracy and have memory to spare, softmax (or 2Mamba-E) can still be preferable.
Required resources:
- GPU with support for efficient Triton kernels and at least TF32/FP32 precision.
- Enough VRAM to hold the per-head second-order state (manageable at 64 head dim, scales with head size and number of heads).
- Standard LLM training stack (AdamW, dataset streaming) as used in the paper.
When not to use it:
- Extremely small contexts (few hundred tokens) on large-memory hardware—plain softmax may be simpler and just as fast.
- If your deployment already depends on a KV cache infrastructure and you prefer its simplicity.
- If you need maximum peak accuracy and can afford the memory, 2Mamba-E or softmax might edge out 2Mamba.
Open questions:
- Adaptive state size: Can we learn the best second-order feature subset dynamically to shrink memory further?
- Mixing with other linear tricks: How much can DeltaNet-style updates or gating further boost 2Mamba?
- Precision-robust training: Can we design kernels that stay stable under even lower precision (e.g., FP8) without losing accuracy?
- Beyond language: How does 2Mamba perform in audio, vision, or multimodal tasks needing long context?
06Conclusion & Future Work
3-sentence summary:
- The authors dissect Mamba-2, keep only the parts that really help (softplus A-mask and a tiny input convolution), and then add a squared query–key match to form 2Mamba.
- 2Mamba reaches near-softmax accuracy while keeping memory fixed with respect to sequence length, and it performs strongly on long-context retrieval.
- An exponentiated variant, 2Mamba-E, slightly outperforms softmax but needs a KV cache, linking this line of work to forget-gated transformers.
Main achievement:
- Showing that a carefully designed, second-order linear attention (with a learned decay) can rival softmax accuracy while preserving linear-time training and fixed-memory inference for long contexts.
Future directions:
- Learnable or adaptive second-order feature sets to reduce memory further.
- Combine with DeltaNet-like updates or new gating for extra accuracy.
- Explore broader domains (code, speech, multimodal) and even longer contexts.
- Improve kernel stability and performance under low-precision math.
Why remember this:
- It narrows the long-standing gap between fast linear attention and accurate softmax attention, offering a practical path to long-context LLMs on modest hardware without giving up much (or any) accuracy.
Practical Applications
- •Build chat assistants that remember longer conversations without ballooning memory use.
- •Summarize multi-thousand-token meeting transcripts on a single consumer GPU.
- •Answer questions over long technical docs or PDFs without chunking into many small pieces.
- •Navigate large code repositories, tracking definitions and references across files efficiently.
- •Process long logs for anomaly detection or root-cause analysis with fixed memory per stream.
- •Deploy on edge devices (laptops, Jetsons) where KV caches would exceed VRAM.
- •Serve long-context LLMs in the cloud with lower and more predictable memory per request.
- •Run retrieval-augmented generation with longer retrieved passages while keeping serving costs down.
- •Do streaming transcription plus summarization over long audio transcripts without memory spikes.
- •Fine-tune domain models (legal/medical) to read long case histories while staying within hardware limits.