Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

Anton Korznikov; Andrey Galichin; Alexey Dontsov; Oleg Rogov; Ivan Oseledets; Elena Tutubalina

Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

Intermediate

Anton Korznikov, Andrey Galichin, Alexey Dontsov et al.2/15/2026

arXiv

Key Summary

•Sparse autoencoders (SAEs) are popular for explaining what large language models are doing, but this paper shows they often don’t learn real, meaningful features.
•On a toy dataset where the true features are known, SAEs rebuilt the data well (71% explained variance) but correctly found only 9% of the actual features.
•The authors created three simple ‘randomized’ baselines (Frozen Decoder, Soft-Frozen Decoder, Frozen Encoder) and showed they perform about as well as fully trained SAEs on key tests.
•Across interpretability, sparse probing, and causal editing, random baselines score 0.87–0.90, 0.69–0.72, and 0.73–0.72, matching or nearly matching trained SAEs.
•This means high reconstruction scores and nice-looking ‘features’ do not prove that SAEs discovered the model’s true building blocks.
•A ‘lazy training’ effect seems to let SAEs do well with only tiny nudges to almost-random directions, not by learning real features.
•The paper provides easy sanity checks that future SAE methods should beat to claim real progress.
•Bottom line: today’s SAEs can look good on standard metrics without actually revealing how the model thinks.

Why This Research Matters

We use interpretability tools to judge whether AI systems are safe, fair, and reliable, so we must know when those tools actually work. This paper shows that today’s sparse autoencoders can look great on common tests while failing to find a model’s true inner features. If teams rely on such signals for audits or safety edits, they might trust the wrong explanations. The provided random baselines are easy reality checks any lab can run before believing SAE results. By raising the standard of evidence, this work helps direct effort toward methods and metrics that reflect real mechanisms. In short, better sanity checks today mean more trustworthy AI decisions tomorrow.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you clean your room, you put toys in one bin, books on a shelf, and clothes in a drawer so it’s easier to find things later? People want to do the same with a big AI’s thoughts: sort them into neat, human-understandable bins called features. That’s the dream behind sparse autoencoders (SAEs).

🍞 Top Bread (Hook): Imagine a giant library where every book is mixed together into one messy pile. You wish you had labels like “mystery,” “science,” and “history” so you could find what you need fast. 🥬 The Concept: Sparse Autoencoders (SAEs) are tools that try to relabel a model’s messy internal signals into a short list of clean, mostly-zero “feature knobs.”

What it is: An SAE turns a big, tangled activation into a few active features that should each mean one thing.
How it works: (1) Read the model’s activation; (2) Pick a few features to turn on; (3) Use those features to rebuild the activation; (4) Learn by making the rebuild closer to the original while keeping few features active.
Why it matters: Without this, we just have a blur of numbers and can’t tell what the model noticed—like trying to understand a library with no labels. 🍞 Bottom Bread (Anchor): If a model reads “Paris is in France,” one ideal feature might light up for “city name,” another for “country,” and another for “location fact.”

The world before: Researchers knew big models worked but didn’t know what their inner parts meant. They hoped SAEs would split the inside into clear, named pieces, solving the problem of polysemantic neurons (one neuron mixing many ideas).

🍞 Top Bread (Hook): You know how a color can be a mix of red, green, and blue? Models can mix thousands of ideas in the same small space. 🥬 The Concept: The Superposition Hypothesis says models store more features than dimensions by stacking them as directions in the same space.

What it is: Many features share the same space like arrows pointing in different directions.
How it works: (1) Each feature is a direction; (2) Inputs turn some features up; (3) The sum gives the full signal; (4) Many more features exist than room to store them separately.
Why it matters: If this is true, SAEs should be perfect tools to pull those directions apart. 🍞 Bottom Bread (Anchor): Like many radio stations broadcasting together, and you try to tune to just the one you want.

The problem: We don’t have labels for the “true” features inside real models. So how do we know if an SAE really found them? People relied on proxy scores like “reconstruction fidelity.”

🍞 Top Bread (Hook): Imagine copying a drawing. If your copy looks similar, that doesn’t mean you understood how the original artist drew it. 🥬 The Concept: Reconstruction Fidelity measures how closely the SAE’s rebuild matches the original activation.

What it is: A score for how good the reconstruction looks.
How it works: (1) SAE compresses; (2) SAE reconstructs; (3) Compare to the original; (4) Closer is better.
Why it matters: It’s easy to believe “great match = learned the right features,” but that can be false. 🍞 Bottom Bread (Anchor): You can redraw a photo perfectly without knowing who or what is in it.

What people tried and where it fell short: Many SAE versions (ReLU, JumpReLU, BatchTopK) achieved pretty reconstructions and nice-looking “features,” but later tests showed weak transfer, sensitivity to settings, and sometimes wrong features. Some “features” even appeared in totally random networks, suggesting we might be fooling ourselves.

The gap: We lacked hard sanity checks. If SAEs really discover features, they should beat simple random baselines.

Real stakes: If we use SAEs to audit safety, diagnose misalignment, or edit facts in models, we must be sure their “features” are real. Otherwise, we could trust a pretty map that doesn’t match the territory.

🍞 Top Bread (Hook): In many real systems, a few events happen a lot, and many events happen rarely—like most songs getting few plays, but a few are hits. 🥬 The Concept: Long-Tailed Activations mean a few features fire often while most fire rarely.

What it is: A skewed pattern where some features are frequent and many are rare.
How it works: (1) Data has common patterns (frequent features); (2) Also many rare patterns; (3) Training tends to fit frequent ones first; (4) Rare ones get ignored.
Why it matters: If SAEs chase what happens most, they might miss the true but rare features. 🍞 Bottom Bread (Anchor): If you only practice the most common test questions, you may fail rare but important ones.

02Core Idea

The “Aha!” in one sentence: If SAEs truly learn meaningful features, they must clearly beat simple random baselines; they don’t.

Three analogies for the same idea:

Magnets vs. chance: If your metal detector is good, it should find more coins than random digging; here, random digging finds almost as many coins as the detector.
Doctor vs. guessing: If a doctor’s tests work, they should diagnose better than guessing; here, guessing with limits does nearly as well.
Map vs. doodle: A real city map should be more helpful than a doodle; here, the doodle helps almost as much.

Before vs. after:

Before: High reconstruction and pretty “feature names” were taken as signs that SAEs discovered the model’s true inner parts.
After: Even SAEs with frozen random parts can score similarly, so those scores don’t prove true discovery.

Why it works (intuition, no math):

SAEs are trained to rebuild signals sparsely, not to match the model’s actual hidden directions. With huge dictionaries (tens of thousands of latent features), random directions already cover many patterns surprisingly well. A little tuning can make them reconstruct nicely without aligning to true features. This is like rearranging puzzle pieces you already had, rather than carving pieces that match the original picture.

Building blocks explained with the Sandwich pattern:

🍞 Top Bread (Hook): You know how sometimes you barely change your homework but still get a good grade because the teacher’s rubric fits what you already wrote? 🥬 The Concept: Lazy Training Regime means the model improves scores with only tiny tweaks from its random starting point.

What it is: Training that mostly keeps weights near their initial directions.
How it works: (1) Start with random features; (2) Make small adjustments to which ones turn on and by how much; (3) Don’t move directions much; (4) Still reduce the loss a lot.
Why it matters: Good scores might come from tiny nudges to random features, not from discovering true ones. 🍞 Bottom Bread (Anchor): A near-random set of “feature arrows” barely rotates, yet the SAE reconstructs well.

🍞 Top Bread (Hook): Imagine drawing a picture using a fixed box of crayons you can’t change. 🥬 The Concept: Frozen Decoder baseline fixes the feature directions to random vectors and never lets them move.

What it is: The “meaning” of each feature is stuck near random.
How it works: (1) Start with random directions; (2) Only learn when to turn them on/off; (3) Reconstruct with that fixed set.
Why it matters: If this nearly matches trained SAEs, then learned directions weren’t crucial. 🍞 Bottom Bread (Anchor): Despite random crayons, the picture still looks decent.

🍞 Top Bread (Hook): Now imagine you can wiggle each crayon a tiny bit but must keep it very close to its original color. 🥬 The Concept: Soft-Frozen Decoder keeps directions close to their random starts (high cosine similarity).

What it is: Directions can move a little, but not much.
How it works: (1) Initialize randomly; (2) Nudge within a tight cone; (3) Train activations and small direction tweaks; (4) Reconstruct well.
Why it matters: If this almost equals a fully trained SAE, we’re seeing lazy training in action. 🍞 Bottom Bread (Anchor): Even with tiny color shifts, the drawing comes out strong.

🍞 Top Bread (Hook): Picture a keyboard whose keys that light up are decided at the start and never change. 🥬 The Concept: Frozen Encoder fixes which inputs activate which features at random; only thresholds and decoder are learned.

What it is: Activation patterns are predetermined.
How it works: (1) Random projections choose when features fire; (2) Train thresholds and decoder to reconstruct; (3) Keep activations fixed in direction.
Why it matters: If this works well, then learned activation patterns weren’t essential. 🍞 Bottom Bread (Anchor): Keys set at random still play many familiar tunes.

Put together: If all three random-style baselines score close to full SAEs on key tests, then current SAEs don’t prove they’ve found real features—only that they can reconstruct and produce appealing artifacts under generous conditions.

03Methodology

At a high level: Inputs (either synthetic activations or real model activations) → Build/train SAEs and randomized baselines → Evaluate on reconstruction, interpretability, probing, and causal editing → Compare to see if learned SAEs beat random baselines.

There are two big recipes: a toy world where we know the true answer, and a real-world test in LLMs where we do not.

Part A: Toy world with known ground truth

Data we make on purpose

What happens: We create fake activations from 100-dimensional space by mixing a hidden dictionary of 3,200 true feature directions. Each sample turns on about 20 true features with sizes drawn from a log-normal, under two regimes: constant-frequency and long-tailed frequency.
Why this step exists: It gives us a clear “answer key” to check whether SAEs actually recover the true features.
Example: Think of 3,200 colored arrows in 100-D space; each data point is a sparse sum of ~20 arrows.

Train SAEs that people like to use

What happens: We train BatchTopK and JumpReLU SAEs with a dictionary size that matches the true one (3,200) and target sparsity ~20.
Why this step exists: Uses state-of-the-art choices so the test is fair and realistic.
Example: It’s like giving two star students the same test to see if they really understand the material.

Measure two things: How close is the copy? Did we find the real features?

What happens: • Reconstruction Fidelity (Explained Variance): How well the SAE’s reconstruction matches the original activation on average. • Feature Recovery via Cosine Similarity: For each true feature, find the nearest learned feature (cosine similarity); count as recovered if the similarity is high.
Why this step exists: Good reconstruction doesn’t guarantee you found the true pieces; we need both scores.
Example: A traced drawing can look great (reconstruction) but doesn’t prove you learned how to draw the parts (recovery).

🍞 Top Bread (Hook): Comparing two arrows by the angle between them is like checking if two people are pointing the same way. 🥬 The Concept: Cosine Similarity for Feature Recovery measures how aligned learned features are with ground-truth features.

What it is: A number from -1 to 1; closer to 1 means more aligned.
How it works: (1) Normalize both arrows; (2) Take their dot product; (3) Pick the best match for each true feature; (4) Count high matches as recovered.
Why it matters: It directly tests if the SAE found the real directions. 🍞 Bottom Bread (Anchor): If the true arrow points northeast and your learned one does too, the score is high.

Part B: Real LLM activations where ground truth is unknown

Collect activations

What happens: Use residual stream activations from Gemma-2-2B (layers 12 and 19) and Llama-3-8B (layer 16); expansion factor ~32; large dictionaries with tens of thousands of latents.
Why this step exists: Test methods on realistic settings used by practitioners.
Example: Like checking a tool on real homework, not just worksheets.

Train three randomized baselines plus normal SAEs

What happens: Build BatchTopK, JumpReLU, and classic ReLU SAEs. For each, also make Frozen Decoder, Soft-Frozen Decoder (keep directions near their random starts), and Frozen Encoder versions. Use identical data, training budget, and initializations.
Why this step exists: This is the key “null test”—if learning real features matters, trained SAEs should beat random baselines clearly.
Example: If a good chef can’t outperform a microwave meal, something’s wrong with our test of cooking skill.

Evaluate four dimensions

Reconstruction Fidelity (Explained Variance): Do reconstructed activations look close to originals?
Latent Interpretability via AutoInterp: Are feature names predictable from top-activating sequences?
Sparse Probing: Can a tiny linear probe on a few high-activation latents predict concepts (e.g., sentiment, topics)?
Causal Editing (RAVEL): Can we flip a target attribute (e.g., city’s country) while keeping others steady?
Why this step exists: These are standard, widely trusted signals of “good SAEs.”
Example: Like testing a bike on smooth roads (reconstruction), asking a friend to describe it (interpretability), doing cone drills (probing), and trying a tricky turn (causal editing).

The secret sauce: Tight, simple baselines that lock parts of the SAE near random. If these do almost as well as trained SAEs, then the usual metrics aren’t strong evidence of true feature discovery. A special twist is the Soft-Frozen Decoder, which probes the “lazy training” idea: even tiny direction changes can yield strong scores with huge dictionaries, suggesting good performance doesn’t require learning the real features.

04Experiments & Results

The test: two kinds of tests answer one big question—do SAEs discover true features or just score well?

Toy world: We know the real features, so we can directly check recovery.
Real LLM world: We don’t know the real features, so we use standard proxy metrics and compare to random baselines.

The competition: Fully trained SAEs (BatchTopK, JumpReLU, ReLU) versus three randomized baselines (Frozen Decoder, Soft-Frozen Decoder, Frozen Encoder).

Scoreboard with context:

Toy world (ground truth known)

Reconstruction (Explained Variance): About 0.67–0.71. That sounds like a solid B+ on the “copy the picture” test.
True feature recovery: Only about 7–9% of features recovered in the realistic long-tailed setting. That’s like finding 7 to 9 correct puzzle pieces out of 100—even though your overall drawing looks good.
Pattern: SAEs mostly grab the most common features and miss the long tail, showing reconstruction ≠ true discovery.

Real LLM world (Gemma-2-2B and Llama-3-8B)

Reconstruction Fidelity: • Fully trained JumpReLU ≈ 0.85 EV (e.g., Gemma-2-2B L12, L0=160). • Soft-Frozen Decoder ≈ 0.79 EV—very close, like scoring 92 vs. 85 in class and still near the top. • Even Frozen Decoder/Encoder reach ~0.58–0.60 in some settings—surprisingly strong for random directions.
Latent Interpretability (AutoInterp): • Fully trained often around 0.90. • Soft-Frozen Decoder around 0.88—nearly tied. • Even Frozen Encoder shows a notable chunk with scores above 0.7. • Translation: Many “features” get readable names even when the directions or activation patterns started random.
Sparse Probing (top-1 latent): • Fully trained around 0.72. • Frozen Decoder about 0.69–0.70; Frozen Encoder ~0.65. That’s like an A- versus a B+/B, very close. • Likely reason: With tens of thousands of random latents, some will correlate with each concept just by chance.
Causal Editing (RAVEL): • Fully trained around 0.72. • Frozen Decoder ~0.57–0.62; Frozen Encoder ~0.63; Soft-Frozen often matches or beats, up to ~0.78. • Message: Targeted edits can work well even without learned feature alignment, likely due to broad coverage by many semi-random directions.

Surprising findings:

“Pretty features” can appear without real learning. AutoInterp shows readable concepts even for frozen baselines.
Soft-Frozen often matches fully trained models. Tiny near-random tweaks plus huge dictionaries are enough to shine on standard metrics.
Reconstruction is not proof of truth. High explained variance coexists with very poor ground-truth recovery in toy tests.

05Discussion & Limitations

Limitations

Synthetic independence: The toy data makes features independent. Real models have correlations. But if SAEs already fail here, adding complexity isn’t likely to fix the core mismatch between reconstruction and true feature recovery.
Focused scope: The study tested leading SAE variants, not transcoders or crosscoders. Matching random baselines for those methods is future work.

Required resources

Trained SAEs and large random baselines need GPU time, big activation datasets (hundreds of millions of tokens), and careful hyperparameter control. Code and configs are promised open-sourced to help replication.

When not to use current SAEs

High-stakes safety judgments where “is this really the model’s circuit?” matters.
Claims about discovering exact features across the activation space, especially rare ones.
Situations where interpretability labels must reflect ground-truth mechanisms, not just descriptive clusters.

Open questions

Better objectives: What training goals reward matching true features rather than just reconstructing?
Stronger tests: Which benchmarks can clearly separate real discovery from random coverage (e.g., adversarial tests, out-of-distribution checks, causal ground-truth tasks)?
Rare feature recovery: How can we ensure long-tail features are captured and validated?
Beyond lazy training: Can we design architectures or curricula that force moving away from near-random directions?
Cross-method comparisons: Do transcoders, crosscoders, or other techniques beat these baselines decisively?

06Conclusion & Future Work

Three-sentence summary

This paper asks whether sparse autoencoders truly learn a model’s real features or just score well on easy checks.
In toy data with known answers, SAEs rebuild nicely but recover only a small fraction of the real features; in real LLM data, simple random baselines match or nearly match trained SAEs on interpretability, sparse probing, and causal editing.
Therefore, today’s standard SAE metrics are not reliable proof of meaningful feature discovery.

Main achievement

The authors provide clear, easy-to-run sanity checks (frozen and soft-frozen baselines) that reveal how much of SAE performance can come from near-random components plus minimal tuning, not from finding true features.

Future directions

Invent training targets that explicitly reward alignment to true mechanisms (not just reconstruction), design tests that require discovering rare/causal features, and develop methods that robustly beat the random baselines across many layers, models, and tasks.

Why remember this

Because it changes the burden of proof: if an interpretability method claims to reveal a model’s inner features, it should first beat these simple random baselines. Until then, good-looking scores and neat “feature names” might be convincing illusions rather than windows into how the model really thinks.

Practical Applications

•Add frozen and soft-frozen baselines to every SAE training run as a required sanity check.
•Gate deployments: require new SAE methods to beat the random baselines by a clear margin on multiple layers and models.
•Use the toy-data protocol with known ground truth to validate new objectives or architectures before moving to LLMs.
•Adopt stricter evaluation: report reconstruction, interpretability, sparse probing, and causal editing next to random baselines.
•Investigate lazy training: monitor cosine drift of decoder directions over time to detect near-random solutions.
•Prioritize rare-feature tests: design evaluations that reward capturing long-tail activations, not just frequent ones.
•Use baseline-beating as a research triage tool: only scale up methods that decisively outperform frozen baselines.
•In safety audits, treat SAE-derived claims as tentative unless they surpass baselines and pass causal ground-truth checks.
•Tune expansion and sparsity with baseline comparisons to avoid illusory gains from dictionary size alone.
•Cross-validate with alternative methods (e.g., transcoders) and require consistency beyond random baselines.

Version: 1