Transformers converge to invariant algorithmic cores
Key Summary
- ā¢Different transformers may have very different weights, but they often hide the same tiny "engine" inside that actually does the task.
- ā¢This paper calls that tiny engine an algorithmic core: a small subspace of the model that is both necessary and sufficient for good performance.
- ā¢Across three casesāMarkov chains, modular addition, and GPT-2 subjectāverb agreementāthe cores are low-dimensional and share the same behavior across independently trained models.
- ā¢For Markov chains, each modelās 3D core encodes the same transition dynamics (same eigenvalue spectrum), even when their larger weight spaces are nearly orthogonal.
- ā¢For modular addition, the core forms right at grokking, revealing a rotational (clock-like) mechanism; with continued weight decay, the core inflates by adding redundant copies.
- ā¢For GPT-2 Small/Medium/Large, a single 1D axis in late layers controls singular/plural agreement; flipping this axis flips grammar choices in free text.
- ā¢Cores are causally validated: keeping only the core preserves performance (sufficiency), removing it breaks performance (necessity), and flipping it can steer outputs.
- ā¢This suggests a new way to do interpretability: focus on simple, shared invariants (the cores) rather than messy, run-specific details.
- ā¢It also hints at practical control: if a behavior is governed by a small core, we can steer it more safely and predictably.
- ā¢Mechanistic interpretability may be most effective in a window after grokking and before later training spreads computation redundantly across many dimensions.
Why This Research Matters
If many different models hide the same tiny engine for a task, then explanations built on that engine can generalize across versions, seeds, and scales. That makes interpretability sturdier and more useful for safety and debugging. Low-dimensional cores are also practical: small targets are easier to monitor, test, and steer at inference time. By identifying when a behavior depends on a 1D or 3D core, we can design precise interventions (like flipping an axis) rather than blunt hacks (like blocking words). Cores can signal model health: if the expected invariant core disappears, something is wrong. Finally, cores may reveal a modelās internal world modelsāgiving us windows into what it has actually learned about the structure of its data.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Hook: Imagine three friends each build a different-looking robot to add numbers on a clock. On the outside, the robots look totally different, but inside, each has a tiny spinning gear that actually does the adding. The big parts vary, but the tiny gear is the same idea.
š„¬ The Concept: Functional equivalence. What it is: many different internal āwiringsā can produce the same outward behavior. How it works: training pushes a model to behave correctly, not to use any one special internal setup. Why it matters: if we explain one modelās wiring, will that explanation work for a different training run? Maybe notāunless we find whatās shared.
š Anchor: Two separately trained calculators both give 7 when asked ā3+4,ā even if their chips are arranged totally differently.
š Hook: You know how a big orchestra can play a symphony, but the melody might be carried by just a few instruments? If you muted everything else, you could still recognize the song.
š„¬ The Concept: Transformers. What it is: a type of AI that predicts the next token in a sequence using attention to focus on important parts. How it works: it turns words into vectors, attends across positions to mix information, passes through tiny neural nets, and repeats across layers. Why it matters: they do lots of tasks well, but their internals are big and tangledāhard to understand.
š Anchor: When you ask a transformer āThe capital of France isā¦,ā it focuses on "France" and picks "Paris."
š Hook: Think of LEGO blocks snapping together to build shapes. Math has its own "blocks" for building shapes of data.
š„¬ The Concept: Linear algebra (vectors and matrices). What it is: the math of directions (vectors) and transformations (matrices). How it works: a matrix stretches, rotates, or squishes vectors. Why it matters: inside transformers, activations are vectors and layers act like learned transformations.
š Anchor: A 2D vector like (3, 4) can be stretched by a matrix to become a new vector like (6, 8).
š Hook: Packing a suitcase: rolling clothes saves space but keeps what you need.
š„¬ The Concept: Dimensionality reduction. What it is: squeezing data into fewer directions while keeping what matters. How it works: we find a small subspace where the important action happens. Why it matters: models are high-dimensional; if the key computation lives in a tiny subspace, we can study and control just that part.
š Anchor: If a 100D model really uses only 3D for a task, we can focus on those 3 dimensions and still do the job.
š Hook: Think of a spinning top that keeps the same spin speed but changes directionāitās doing a rotation.
š„¬ The Concept: Eigenvalues and eigenvectors. What it is: special directions (eigenvectors) that a transformation scales or rotates by a factor (eigenvalue). How it works: we analyze a matrix by its eigenvalues to understand its dynamics (growth, decay, oscillation). Why it matters: the paper identifies task dynamics (like cycles) by looking at eigenvalues of fitted operators in the core.
š Anchor: A pure rotation in 2D keeps length the same; its eigenvalues have magnitude 1 (on the unit circle), signaling rotation.
š Hook: Reading a story where each sentence depends mostly on the last oneālike a treasure hunt where the next clue depends on your current location.
š„¬ The Concept: Markov chains. What it is: systems where the next state depends only on the current stateās probabilities. How it works: a transition matrix tells you the chance of jumping from one token to another. Why it matters: a simple testbed to see if different transformers learn the same internal ācoreā dynamics.
š Anchor: From āsunnyā today, there might be a 75% chance tomorrow is āsunnyā again and 25% chance ācloudy.ā
š Hook: You know how 14:00 on a 12-hour clock is 2:00? The numbers wrap around.
š„¬ The Concept: Modular addition. What it is: adding with wrap-around at a number p. How it works: compute by adding then subtracting multiples of until youāre in . Why it matters: transformers āgrokā this task and learn a cyclic (rotating) mechanism in a tiny core.
š Anchor: (since 14+5=19, and 19ā12=7).
š Hook: Imagine a single sliding dimmer that controls a roomās vibe from "singular" to "plural."
š„¬ The Concept: Subjectāverb agreement. What it is: choosing verbs that match the subjectās number (singular/plural). How it works: a hidden number signal steers whether the model picks āis/wasā or āare/were.ā Why it matters: finding the single axis that flips this choice is strong evidence of a tiny, steerable core.
š Anchor: Projecting the hidden state onto one direction predicts whether āareā will beat āis.ā
š Hook: If two cameras film the same play from different angles, the story is the same even though the pixels differ.
š„¬ The Concept: Algorithmic cores. What it is: small subspaces inside a model that are both necessary and sufficient for a task and are shared across different training runs. How it works: extract the most active-and-relevant directions, validate by ablations, and study their dynamics. Why it matters: they are invariantsāstable across runsāso explanations based on cores can generalize.
š Anchor: Three transformers with different weights compress to the same 3D core that reproduces the same Markov-chain spectrum.
š Hook: Picture a class that suddenly āgets itā after strugglingātest scores jump at once.
š„¬ The Concept: Grokking. What it is: when a model goes from memorizing the training set to truly generalizingāoften suddenly. How it works: after loss is near zero, regularization and redundancy push the model toward a simple, general mechanism. Why it matters: in modular addition, the algorithmic core snaps into a clean rotational operator exactly at grokking.
š Anchor: Test accuracy stays low for a long time, then spikes to perfect when the cyclic core forms.
š Hook: Think of a gentle tug that keeps pulling all weights smaller unless theyāre needed.
š„¬ The Concept: Weight decay. What it is: a training rule that nudges parameters toward zero. How it works: it adds a penalty on large weights, encouraging simpler models. Why it matters: after grokking, it can spread (redistribute) the solution across many redundant modes, inflating the core.
š Anchor: Keep weight decay on after solving modular addition and you see more and more rotational modes appear in the core.
š Hook: If two choirs sing the same tune, we can compare their harmonies without matching every singerās position.
š„¬ The Concept: Canonical Correlation Analysis (CCA). What it is: a way to compare two representational spaces by finding directions that correlate most. How it works: it pairs up directions to reveal shared signals even if geometries differ. Why it matters: cores from different seeds show near-perfect CCA, proving shared content despite geometric misalignment.
š Anchor: Even when core subspaces are almost orthogonal, their CCA correlations are about 0.98ā0.99ānearly identical signals.
Putting it together: Before this paper, many interpretability stories focused on specific circuits that might not repeat across trainings. The big gap was: how do we find what stays the same? This work shows that transformers often pack their task logic into tiny, invariant algorithmic coresāsimple, shared enginesāgiving us a sturdier foundation for understanding and control.
02Core Idea
š Hook: You know how different musicians can play the same song on different instruments, but the melody is the same? If you write down that melody, you can recognize the song anywhere.
š„¬ The Concept: The "Aha!" in one sentence: Despite different weights and wiring, transformers converge to the same small, causal subspacesāalgorithmic coresāthat carry the taskās real computation.
How it works (recipe intuition):
- Find where activations are most active and also most relevant to the output.
- Take just that small subspace (the core) and test it: keep only the core (should still work), or remove it (should break the task).
- Fit simple dynamics inside the core (like a tiny linear operator) and read its eigenvalues to see what algorithm it runs (e.g., rotation for modular addition).
- Repeat across independently trained models; align and compare cores to check if the same structure reappears.
Why it matters: If explanations live in these cores, they generalize across random seeds and even scales, giving us simpler, sturdier handles for understanding and steering models.
š Anchor: Three 64D one-layer transformers shrink to the same 3D core that recreates the Markov transition spectrum; three 2-layer modular-addition transformers snap to a rotational core at grokking; GPT-2 Small/Medium/Large each show a 1D axis that flips singular/plural everywhere.
Multiple analogies (three ways):
- Music: Different bands, same melody (the core). The arrangement (full weights) can change wildly; the recognizable tune (algorithm) persists.
- Maps: Two maps can use different colors and fonts (weights) but still describe the same city layout (core structure). You navigate by the invariants.
- Engines: Cars look different, but the crankshaft (core) converts up-down motion to rotation the same way. Swap the paint; the engine idea stays.
Before vs. After:
- Before: Interpretability often traced detailed circuits in one model and hoped they generalized.
- After: We can extract cores that are causally necessary and sufficient, repeat across runs, and admit compact dynamical stories (like recovered spectra or rotations). The same idea works from toy tasks to GPT-2 grammar.
Why it works (intuition, no equations):
- Training selects for behavior, not specific wiring, so many weight patterns land on the same function (degeneracy). Among these, thereās a minimal internal storyāthe smallest subspace you must keep to get the behavior. That subspace is observable by looking for directions both driven by inputs and used to make outputs (active + relevant). When you zoom into that subspace, the computation is often simple (like a rotation) and reveals the taskās structure (like a Markov chainās spectrum).
Building blocks:
- Active directions: parts of the hidden state that actually move when inputs vary a lot.
- Relevant directions: parts the model truly uses to set logits for the task at hand.
- Core extraction: find directions that are both active and relevant; keep just those.
- Causal tests: keep-only (sufficiency) and remove-only (necessity) ablations.
- Operator fitting: in the tiny core, fit a simple linear step and look at eigenvalues for dynamics (e.g., rotations on the unit circle mean cycles for modular arithmetic).
- Cross-model alignment: even when cores sit in different geometric angles, CCA shows they carry the same signalāproving invariance.
Tiny math peeks with examples:
- Modular addition wraps around: . Example: (since 18+7=25, and 25ā24=1).
- Rotation signals cycles: if an eigenvalue has , itās on the unit circle. Example: has , so it represents pure rotation.
- Flipping along a core axis reflects activations: . Example: let , , ; then , so , which flips the first coordinate.
Bottom line: The melody (the invariant core) is the real computation. Find it, test it, and you can explain and even steer the model across different trainings and sizes.
03Methodology
At a high level: Input ā collect hidden activations and output sensitivities ā extract a small active-and-relevant subspace (the core) ā verify with causal ablations ā fit simple dynamics inside the core ā compare and align cores across runs.
Step-by-step recipe (ACE: Algorithmic Core Extraction):
-
Choose a task and a layer. Example: subjectāverb agreement in GPT-2, using a late layer where agreement decisions happen.
- Why this step: The core lives where the taskās decision gets finalized; wrong layer means youāll see noise or partial information.
- Example data: 1,200 prompts balanced between singular/plural.
-
Gather activations (activity). Collect the mean-centered hidden states at the chosen layer over many inputs.
- Why: Active directions are the ones that actually change with inputs; dead directions canāt be the core.
- Example: For GPT-2 Small, collect the final-token hidden vector for each prompt, then subtract the mean.
-
Measure output relevance (sensitivity). Define a task-relevant function (e.g., plural vs. singular logit margin), and compute or estimate its Jacobian with respect to (how changing nudges outputs).
- Why: A direction that doesnāt affect the task output canāt be part of the causal core.
- Example: For agreement, . Example value: if logits are (are=3.0, were=2.0, is=4.0, was=1.0), then .
-
Find joint active-and-relevant directions. Compute the top interaction directions between activity and relevance (conceptually akin to balanced reduction). Keep the top-r directions as the core.
- Why: This targets where input variation flows into output decisions.
- Example: In GPT-2, a huge spectral gap often leaves āa single axis.
-
Build the projector onto the core. Let be an orthonormal basis for the core; then projects any activation onto the core.
- Why: We need to do core-only (keep) and core-removed (drop) ablations.
- Example: If , then projects a 2D vector onto the x-axis: .
-
Causal validation via ablations.
- Core-only (sufficiency): Replace by . If performance stays the same, the core alone is enough.
- Core-removed (necessity): Replace by . If performance drops to chance, the core was required.
- Why: This is the gold-standard test that the core really causes the behavior.
- Example: In Markov models, core- accuracy; core-.
-
Fit simple dynamics inside the core. Project hidden states into core coordinates . Fit a linear operator to predict the next step (for sequences) or to model āshift-by-1ā across classes (for modular addition centroids).
- Why: The eigenvalues of reveal the algorithm (e.g., rotations for cyclic tasks, Markov mixing rates for chains).
- Example: Unit-circle eigenvalues () indicate rotation; e.g., has .
-
Compare across runs and scales. Align cores across independently trained models (e.g., via CCA). Check whether cores carry the same signal (high correlation) even if their geometric angles differ (near-orthogonal subspaces).
- Why: This tests invarianceāthe whole point.
- Example: Three Markov transformers: projector .02ā0.04 (geometrically different), .98ā0.99 (statistically the same).
Each experimental setting, detailed:
-
Markov chains (one-layer transformers):
- What happens: Extract a 3D core; run core-only/core-removed ablations; fit linear dynamics inside the core; compare the operatorās eigenvalues to the true Markov transition spectrum.
- Why this step: If the core operatorās eigenvalues match the chainās nontrivial eigenvalues, the core encodes the chainās dynamics.
- Example: True spectrum includes and ; the fitted core operator matches within about 1%.
-
Modular addition (two-layer transformers):
- What happens: Track cores over training; notice test accuracy spike (grokking) aligns with a sharp core condensation to low rank and the operatorās eigenvalues snapping onto the unit circle.
- Why this step: Shows the cyclic (rotational) mechanism emerges automatically at grokking.
- Example formula: Modular wrap . Example: (since 41+17=58 and 58ā53=5).
- Secret sauce: After grokking, with weight decay on, the core inflates by spreading signal across many rotational modes (redundancy). Turning weight decay off later keeps it compact.
-
GPT-2 agreement (Small/Medium/Large):
- What happens: Layer sweep to find where agreement is controlled; extract a core; discover a 1D axis that is sufficient (keep- AUC), necessary (remove- chance), and steerable (flipping the axis inverts number preference across free generation).
- Why this step: Proves a universal, steerable core across models with 117Mā774M parameters.
- Example flip math: Reflect hidden state across the axis with . Example: if and is unit, then subtract ; a positive plural tilt becomes a strong singular tilt.
The secret sauce (what makes it clever):
- Itās causal (tested by ablations), compact (tiny subspaces), dynamical (operators + eigenvalues tell a simple story), and universal (recurs across seeds and scales). Instead of guessing circuits, we discover the invariant āmelodyā automatically.
04Experiments & Results
The tests and why:
- Markov chains: Do different one-layer transformers learn the same internal dynamics even if their weights are unrelated? We measure test accuracy, compare weight similarity (should be low), then extract cores and compare their spectra to ground truth.
- Modular addition: Can we recover the unknown algorithm automatically, and can we watch it form over time (grokking)? We measure test accuracy, core dimension, and the operatorās eigenvalues over epochs.
- GPT-2 agreement: Is there a small, steerable axis controlling singular/plural across model scales? We measure AUC for agreement under core-only, core-removed, and core-flipped interventions, and see if the axis predicts verb margins.
Competition (what we compare to):
- Chance and Bayes-optimal baselines in Markov chains.
- Pre- vs. post-grokking checkpoints in modular addition.
- Base vs. ablated/steered performance in GPT-2.
- Full-model alignment controls vs. core-based alignment.
Scoreboard with context:
-
Markov chains:
- All three runs: test .75 (near Bayes-optimal for the stochastic chain). Weight cosine similarity across zero (as different as random directions).
- Cores: 3D and causally validated. Core-/optimal; core-.
- Geometry: projector .02ā0.04; principal °ā90° (nearly orthogonal!).
- Statistics: .98ā0.99 (almost perfect shared signal). Thatās like saying the melodies match even though the sheet music uses different staves.
- Dynamics: Fitted eigenvalues match the nontrivial Markov spectrum within about 1%. Thatās like reproducing the beats and tempo of the original song exactly.
-
Modular addition:
- Grokking: test accuracy spikes around epoch 800, exactly when the core snaps into a low dimension and the operatorās eigenvalues jump onto the unit circle.
- Interpreting unit-circle eigenvalues: means rotation, the perfect mechanism for wrap-around addition. Example eigenvalue: has .
- After grokking, continued weight decay inflates core size (ā15 ā ā60). Sufficient dimensions stay about the same (ā15ā20), but necessary dimensions growāshowing redundancy expansion.
- Operator saturation: mode counts around the circle rise toward the maximum number of harmonic bins; disabling weight decay after grokking prevents this proliferation and keeps the core compact.
- Surprise: Weight decay, meant to simplify, can spread the solution over many modes, making it bigger but redundantly robust. Itās like learning the same song in 27 harmonies at once.
-
GPT-2 agreement (Small/Medium/Large):
- Layer sweep: little causal effect in early layers, strong effect in late layers at a conserved depth across scales.
- Core is 1D with a huge spectral gap. Core-only .975ā0.997 (near perfect), core-removed .217ā0.244 (below chance), core-flipped .021ā0.038 (near perfect inversion). Thatās like turning an A student into a perfect opposite-day student by flipping one switch.
- Predictive axis: projection onto the core linearly predicts the singularāplural margin with strong correlations (.76ā0.82 within models; cross-model Pearson .924ā0.968). In one dimension, Pearson r equals CCA.
- Open-ended generation: flipping the axis adaptively at each step flips agreement in long text (not just on āis/are/was/wereā), suggesting a global number variable is being controlled.
Surprising findings:
- Cores from independent Markov transformers are geometrically near-orthogonal yet statistically almost identicalāfunctionally the same melody in different keys.
- Grokking aligns with the sudden formation of a clean rotational operator inside the core.
- Keeping weight decay on after grokking inflates the core by adding many redundant rotational modes; turning it off freezes a compact, interpretable core.
- A single axis in GPT-2 controls agreement across scales and can be flipped to invert grammar in free text; effects propagate beyond the original target verbs.
Tiny math check-ins with examples:
- Chance vs. Bayes-optimal (concept): Bayes-optimal picks the most likely next token given the current one. For a row like , picking the 0.75 token is optimal. Example: If the current token is A, and the next is B with 25% chance, predicting A again wins 75% of the time.
- Unit circle eigenvalue magnitude: . Example: ā .
- Axis reflection for steering: . Example: If and is unit, then subtract ; a plural-leaning hidden state becomes equally singular-leaning.
Overall: Across three growing levels of complexity, cores are small, causal, dynamically meaningful, and shared. They give us a scoreboard that measures not just accuracy, but whether weāve found the true engine.
05Discussion & Limitations
Limitations (be specific):
- Task scope: Markov chains, modular addition, and agreement are controlled or local tasks. We donāt know if multi-hop reasoning or complex planning also collapse to tiny cores.
- Linear operator fits: Inside cores, simple linear dynamics worked well for these tasks. Harder tasks may need nonlinear operators; identifying those reliably is open.
- Core extraction choices: ACE uses activity and relevance as defined here. Different choices (e.g., different targets, Jacobian estimates, or datasets) could change the extracted core.
- Post-training dynamics: Weight decay schedules shape core size. Real-world training recipes vary, so interpretability windows may be narrower or shift.
- Generalization across architectures: We showed invariance across seeds and GPT-2 scales; broader architectures (e.g., mixture-of-experts) remain to be tested.
Required resources:
- Access to hidden activations and (approximate) Jacobians or gradient-like sensitivities.
- Sufficient compute to gather activation datasets and run SVD/CCA; modest compared to training, but nontrivial for large models.
- Causal intervention hooks (to implement core-only, core-removed, and core-flipped runs) in the inference stack.
When NOT to use:
- If the task definition is vague and you canāt write a target function (no clear way to define relevance), ACE may return a blurry or misleading subspace.
- If the model behavior depends on many entangled tasks at once without a clean decomposition, a single core may not exist or may not be low-dimensional.
- If training hasnāt converged or is wildly unstable, extracted cores might reflect transient heuristics rather than a stable algorithm.
Open questions:
- Do complex reasoning tasks (e.g., tool use, multi-hop QA) also funnel through small, invariant cores? If yes, what do those operatorsā spectra look like?
- Can we build stronger theory connecting ACE to control concepts like observability and minimal realizations for nonlinear systems?
- How do cores interact across layers? Are there hierarchical cores that compose into bigger algorithms?
- Can we automate task decompositionādiscovering multiple cores for multiple subskillsāwithout hand-crafted prompts?
- Can cores predict safe and effective model merging? If two models share an aligned core but differ elsewhere, is that a green light to merge?
Takeaway: Cores give us a compact, causal, and universal target for interpretability and controlābut scaling to frontier behaviors, building rigorous theory for nonlinear settings, and automating task discovery are the big next steps.
06Conclusion & Future Work
3-sentence summary: Even when transformers have very different weights, they often hide the same tiny, low-dimensional engineāthe algorithmic coreāthat truly carries the task. These cores are causally necessary and sufficient, share the same dynamics across independent trainings, and can be read by fitting simple operators (e.g., rotations for modular addition, spectra for Markov chains). In GPT-2 models, a single axis in late layers controls subjectāverb agreement and can be flipped to invert grammar in free text.
Main achievement: A general, automated method (ACE) to extract, validate, and interpret invariant algorithmic coresāshowing that transformers converge to compact shared computations across seeds and scales.
Future directions:
- Test whether complex reasoning and planning also compress into small cores, and what their spectra reveal.
- Connect ACE more tightly to control theory for nonlinear systems and to information-theoretic invariants.
- Develop multi-core discovery for multi-skill tasks and use cores as robust coordinates for model merging and safe steering.
Why remember this: It reframes interpretability from chasing shifting circuits to uncovering stable invariantsāthe computational melody that repeats no matter who plays it. If we can routinely find and steer these tiny engines, we gain clearer understanding, better diagnostics, and practical control over large models.
Practical Applications
- ā¢Causal debugging: Remove or preserve a core to confirm whether it truly drives a failure mode or capability.
- ā¢Safe steering: Flip or nudge a small core axis to correct undesirable outputs (e.g., invert a bias or fix agreement).
- ā¢Training diagnostics: Track core dimension over time to detect grokking and avoid post-grokking redundancy inflation.
- ā¢Model merging: Align and compare cores to assess merge-compatibility before interpolating weights.
- ā¢Monitoring: Watch core activations as compact indicators of task engagement during deployment.
- ā¢Compression: Keep only core-relevant directions for task-specific adapters or efficient inference.
- ā¢Curriculum design: Trigger core formation earlier by adjusting weight decay schedules around grokking.
- ā¢Feature discovery: Use core spectra (eigenvalues) to identify latent cyclic or Markovian structure in tasks.
- ā¢Robustness checks: Validate that a task still depends on the same core after fine-tuning or domain shift.
- ā¢Explainable demos: Show users the single slider (core axis) that controls a visible behavior like singular/plural.