Revisiting the Platonic Representation Hypothesis: An Aristotelian View

Fabian Gröger; Shuo Wen; Maria Brbić

Revisiting the Platonic Representation Hypothesis: An Aristotelian View

Intermediate

Fabian Gröger, Shuo Wen, Maria Brbić2/16/2026

arXiv

Key Summary

•People thought big AI models were all learning the same overall picture of the world, but those measurements were secretly biased by model size and depth.
•The authors show two confounders: wider models (more features) and deeper models (more layers to compare) both inflate similarity scores even when nothing is truly similar.
•They fix this with a simple, metric-agnostic, permutation-based null-calibration that sets a fair zero point and provides valid p-values.
•After calibration, global spectral measures (like CKA) mostly lose their "we all match globally" trend across models and modalities.
•However, local neighborhood similarity (who is near whom) remains strong and scales with model capacity across images, text, and even video.
•Local distances themselves do not consistently agree—it's the neighbor relationships (rank order) that align, not the exact spacing.
•The paper proposes the Aristotelian Representation Hypothesis: neural networks converge to shared local neighborhood relationships.
•The calibration framework offers statistical guarantees, controls false positives, and works for any bounded similarity metric.
•This helps researchers make fairer model comparisons, pick better transfer partners, and design more trustworthy brain-model alignment studies.

Why This Research Matters

Fair comparisons change conclusions. Many labs and companies choose models for transfer or fusion by reading similarity plots; if those plots are biased by model width and depth, they may pick the wrong partners. Calibrated scores prevent overclaiming that all large models share the same global map, focusing attention on the robust signal that truly persists: local neighbor agreement. This helps design better multimodal systems (e.g., image–text or video–text search) that lean on shared neighborhoods. It also improves neuroscience comparisons, avoiding false “brain-like” claims from inflated global measures. Overall, it upgrades our measurement toolkit so future scaling studies can be trusted.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how you and your friend might both sort your toy cars by color, but your boxes are different shapes? Even if your boxes look different from the outside, the way you placed red cars next to other red cars might be similar on the inside.

🍞 Top Bread (Hook): Imagine two classmates each making a map of the same playground. Their drawings look different overall, but the swings are still drawn near the slide in both. That “near-ness” is a clue the maps share structure. 🥬 Filling (The Actual Concept): Representational similarity metrics are tools to measure how alike two AI models’ internal maps (representations) are.

What it is: They turn model activations into numbers that summarize “how similarly two models organize the same data.”
How it works: 1) Feed the same items to both models. 2) Collect their feature vectors. 3) Compare them with a chosen metric. 4) Output a score (higher means more alike).
Why it matters: Without them, we’d be guessing whether two models see the world similarly; with them, we can make scientific claims and design better transfers. 🍞 Bottom Bread (Anchor): If two vision models both group cat pictures near dog pictures but far from car pictures, a similarity metric should give a high score.

🍞 Top Bread (Hook): You know how a satellite photo shows a whole city at once, while a street photo shows just your block? 🥬 Filling: Global spectral measures look at the whole shape of a representation space at once.

What it is: Metrics (like CKA) that summarize global, big-picture alignment of two spaces using matrix spectra or correlations.
How it works: 1) Build big matrices of interactions within and across models. 2) Compute energies or correlations. 3) Normalize to get a score from about 0 to 1.
Why it matters: They tell us if the two “whole cities” have similar layouts, not just one block. 🍞 Bottom Bread (Anchor): CKA might say two language models built on different data still share a similar overall layout of sentence features.

🍞 Top Bread (Hook): When you meet friends at school, you notice who stands next to whom more than the exact centimeters between them. 🥬 Filling: Local neighborhood similarity measures who is near whom in each model’s space.

What it is: Metrics (like mutual k-Nearest Neighbors) that ask whether each point’s closest neighbors match across two models.
How it works: 1) For each item, find its top-k neighbors in model A. 2) Do the same in model B. 3) Count overlaps. 4) Average across all items.
Why it matters: Even if the big maps look different, shared neighbor lists show models agree about local relationships. 🍞 Bottom Bread (Anchor): If both models say a certain photo is most similar to the same 10 other photos, that’s local neighborhood agreement.

🍞 Top Bread (Hook): Before measuring with a ruler, you check it starts at zero, right? 🥬 Filling: The Platonic Representation Hypothesis (PRH) claimed that as models get bigger, their internal maps converge to a common global structure of reality.

What it is: A big-picture claim that different models and modalities become globally similar when scaled up.
How it works: 1) Compare many models. 2) Measure similarity. 3) See scores rise with size. 4) Conclude convergence.
Why it matters: If true, it says there’s one shared global blueprint models discover. 🍞 Bottom Bread (Anchor): Studies showed CKA rising with model scale across image and text models—interpreted as “we’re all drawing the same city map.”

But there’s a twist. The authors found two invisible tricksters that make scores look higher even when there’s no real similarity.

🍞 Top Bread (Hook): Think of blowing up a photo—noise gets bigger too, and you might see patterns that aren’t real. 🥬 Filling: The width confounder means higher feature dimensions create a positive baseline in many global metrics, inflating similarity.

What it is: A fake similarity boost that grows when models use more features than the sample size can reliably support.
How it works: 1) In high dimensions, random cross-structure doesn’t average to zero. 2) Spectral metrics pick up this random energy. 3) Wider models look more aligned by accident.
Why it matters: Comparing a narrow model to a wide one becomes unfair—wider can win without being truly closer. 🍞 Bottom Bread (Anchor): Two unrelated models, each with thousands of features but only about a thousand samples, can score surprisingly “similar” just from dimensional noise.

🍞 Top Bread (Hook): If you roll 100 dice and only report the highest number, it’ll sound like you’re a lucky genius. 🥬 Filling: The depth confounder means taking the maximum over many layer pairs inflates scores as models get deeper.

What it is: A search-over-layers effect—more comparisons raise the chance of a high score by luck.
How it works: 1) Compute all layer-to-layer scores. 2) Pick the max. 3) More layers = more chances for a big random number. 4) The max grows even with no true match.
Why it matters: Deep models look more aligned simply because you looked in more places. 🍞 Bottom Bread (Anchor): If you compare 10 layers vs. 100 layers and report the best match, the 100-layer model will often “win” by chance.

These confounders were often unaccounted for. So the field risked over-claiming global convergence. The paper’s fix is a fair, easy baseline: shuffle the data to learn what “chance” looks like, then measure above that.

02Core Idea

🍞 Top Bread (Hook): You know how in science class we first learn what “no effect” looks like before deciding if something really works? Like testing plants with plain water to know the baseline. 🥬 Filling: The “Aha!” moment: Use permutation-based null-calibration to set a fair zero for any similarity metric, then re-check the convergence story; it turns out global convergence fades, but local neighborhood agreement remains.

Multiple analogies (same idea, 3 ways):

Thermometer analogy: If the thermometer is stuck 5 degrees too high, every room seems warmer. Calibration subtracts that 5, and you finally know which rooms are truly warmer.
Game show analogy: If you flip many coins and only show the best streak, it looks like magic. Calibration asks, “How good would the best streak be by luck?” and only counts beyond that.
Maps vs. blocks: Even if two city maps look different overall, both still put the bakery near the school. Calibration shows global maps don’t truly match, but block-level neighbors do.

Before vs. After:

Before: Scores went up with model size—people concluded models share a global blueprint across modalities.
After: Once you remove width and depth biases, the global trend largely disappears. But local neighbor relationships keep showing strong agreement, even across images, text, and video.

Why it works (intuition, not equations):

In high dimensions, randomness has structure. Spectral metrics feel that structure and report a non-zero baseline.
Searching over many layers naturally produces a high best score by chance.
Permuting breaks the true pairing between X and Y while keeping everything else the same. Whatever score you get then is your “chance level.” Now you can fairly ask, “Is my real score bigger than chance, and by how much?”

Building Blocks (each with a simple purpose):

🍞 Top Bread (Hook): Ever race your friend but start one step behind or ahead? Not fair! You need a fair start line. 🥬 Filling: Null Calibration
- What it is: A procedure that turns raw similarity into a calibrated score with a principled zero and a p-value.
- How it works: 1) Compute the real score. 2) Shuffle rows of one model K times. 3) Recompute K null scores. 4) Find a threshold (e.g., 95th percentile). 5) Set calibrated score to 0 if you’re below threshold, otherwise rescale to max=1.
- Why it matters: It removes size-related illusions and makes scores comparable across models and metrics. 🍞 Bottom Bread (Anchor): If your uncalibrated CKA is 0.35 but chance is 0.30, the calibrated score counts only the 0.05 above chance.
🍞 Top Bread (Hook): If you pick the tallest person from a huge crowd, they’ll be tall even if the average person isn’t. 🥬 Filling: Aggregation-aware Calibration
- What it is: Calibrating not just each layer-pair score, but the whole selection procedure (like the max over layers).
- How it works: 1) For each permutation, recompute all layer scores. 2) Apply the same aggregator (e.g., max). 3) Build a null for the aggregate. 4) Compare the observed aggregate to this null.
- Why it matters: Prevents deep models from looking better just because you searched more pairs. 🍞 Bottom Bread (Anchor): If max-over-layers is 0.50 but typical permuted max is 0.48, you only credit the 0.02 beyond chance.
🍞 Top Bread (Hook): When choosing a study buddy, you care who sits next to whom (neighbor groups), not the exact centimeters apart. 🥬 Filling: Aristotelian Representation Hypothesis
- What it is: Models converge to shared local neighborhood relationships, not a single global blueprint.
- How it works: 1) Compare neighbor lists across models. 2) See significant overlap that grows with capacity. 3) Note: exact distances need not match.
- Why it matters: Tells us where the real cross-model agreement lives—useful for transfer, retrieval, and multimodal alignment. 🍞 Bottom Bread (Anchor): Two different encoders both agree that “this caption” is closest to “these few images,” even if their whole embedding spaces are arranged differently.

03Methodology

At a high level: Same inputs → compute raw similarity → build a chance baseline by shuffling → get a calibrated score (and p-value) → for multi-layer models, calibrate the selection step too.

Step-by-step recipe:

Gather paired data

What happens: Pick n matching items (e.g., the same 1,024 image–text pairs). Get representations X from model A and Y from model B; rows correspond to the same items.
Why this exists: We need one-to-one pairing to ask whether two models organize the same items similarly.
Example: Row i is the embedding of the i-th image in X and of its caption in Y.

Choose a similarity metric

What happens: Decide whether to focus on global structure (e.g., CKA) or local neighbors (e.g., mutual k-NN).
Why this exists: Different questions need different rulers. Global measures summarize the whole space; local measures ask who’s next to whom.
Example: Use CKA-RBF for global spectral similarity and mKNN with k=10 for local neighborhood agreement.

Compute the raw score

What happens: Run $s_o$ bs = s(X, Y).
Why this exists: It’s your first glance at similarity, but beware—it may be biased by width (dimension) and depth (number of layer comparisons).
Example: You might get CKA = 0.42 or mKNN = 0.15.

Build the empirical null by permutation (scalar case)

What happens: Shuffle the pairing between X and Y by randomly permuting the rows of Y, K times (e.g., K=200). For each shuffled Y, recompute s(X, perm(Y)).
Why this exists: Shuffling breaks true relationships while keeping all other properties (dimensions, preprocessing) identical—this reveals what “by chance” looks like for your exact setup.
Example: Your null scores for CKA might cluster around 0.30, showing a non-zero chance baseline.

Get the threshold and p-value

What happens: Sort the K null scores together with the observed score; pick the (1−α)-quantile as the threshold τ_α (e.g., the 95th percentile for α=0.05). Compute a permutation p-value by counting how many null scores are ≥ $s_o$ bs.
Why this exists: The threshold sets a principled zero point (typical chance level). The p-value tells how surprising $s_o$ bs is under chance.
Example: If 4 out of 200 null scores exceed $s_o$ bs, $p ≈ 5$ / $201 ≈ 0$ .025.

Form the calibrated score (scalar case)

What happens: For bounded metrics (often max=1), rescale: $s_c$ al = max(( $s_o$ bs − τ_α)/(1 − τ_α), 0). If $s_o$ bs ≤ τ_α, set $s_c$ al = 0.
Why this exists: It removes the chance floor and preserves the top of the scale (1). Now calibrated scores are comparable across widths and datasets.
Example: If $s_o$ bs=0.42 and τ_α=0.30, $s_c$ al=(0.12)/(0. $70)≈0$ .171.

Aggregation-aware calibration (depth confounder)

What happens: If you plan to report the max over many layer pairs, repeat Steps 4–6 but apply the same shuffle to all layers and recompute the aggregate (e.g., max) each time. Compare the observed aggregate to the null distribution of the aggregate.
Why this exists: It neutralizes the “more layers → higher max by luck” effect.
Example: If the observed max is 0.58 but typical permuted max is 0.55, only the extra 0.03 counts beyond chance.

Statistical guarantees and reporting

What happens: Because permutation p-values are super-uniform under the null, declaring “ $s_c$ al > 0 at α” controls false positives. When testing many pairs, apply standard multiple-testing corrections (e.g., FDR).
Why this exists: Ensures honest, reproducible claims.
Example: After calibration, you say, “Local alignment is significantly above chance for 90% of model pairs (FDR<0.05).”

The secret sauce (why this method is clever):

It’s metric-agnostic: works for CKA, RSA, mKNN, and more—no custom math derivations needed.
It corrects two different biases at once: the width baseline (high-dimensional chance energy) and the depth baseline (search-over-layers inflation).
It produces effect sizes on a shared, interpretable scale (zero = typical chance; one = perfect), plus valid p-values.

🍞 Top Bread (Hook): Think of two rulers: one correctly starts at zero; the other secretly starts at five. You can’t compare measurements unless you fix the zero. 🥬 Filling: Behavior of spectral metrics under high dimension

What it is: Spectral metrics (like CKA/CCA) show a non-zero chance floor that grows with feature dimension relative to sample size.
How it works: 1) High-dimensional randomness has stable, non-trivial energy. 2) Spectral summaries capture that energy. 3) As d/n grows, the baseline rises.
Why it matters: Without calibration, wider models look globally more similar even when they’re not. 🍞 Bottom Bread (Anchor): In synthetic tests, uncalibrated CKA rose with d/n; calibrated CKA snapped back to zero under no true relation.

🍞 Top Bread (Hook): If you ask 200 classmates the same yes/no question, at least one is likely to say “yes” by chance. 🥬 Filling: Depth confounder under aggregation

What it is: Max-over-layers inflates with the number of pairs searched.
How it works: 1) More comparisons → higher best score by luck. 2) Aggregation-aware calibration measures how big the best score would be under shuffles. 3) Only counts beyond that.
Why it matters: Lets deep and shallow models be compared fairly. 🍞 Bottom Bread (Anchor): In simulated data with no true signal, raw max rose with layer count; the calibrated max stayed flat near zero.

04Experiments & Results

The tests (what and why):

Synthetic sanity checks: Show that raw global metrics inflate with width (d/n) and that max-over-layers inflates with depth; verify that calibration removes both without losing power to detect real signals.
Real multimodal tests: Re-run the Platonic Representation Hypothesis setup for image–language (WIT) and extend to video–language, comparing families of encoders across scales.
Why: To see what remains of “convergence” after we fix the baselines.

The competition (what it’s compared against):

Global spectral metrics: Linear and RBF CKA, plus CCA family and RSA.
Local neighborhood metrics: Mutual k-NN (k=10 by default), cycle-kNN, and CKNNA.
Baseline vs. calibrated: Always compare raw scores to their calibrated versions to judge how much apparent similarity was just chance.

Scoreboard with context:

Width confounder (synthetic): Uncalibrated CKA climbed steadily as d/n grew, like getting higher and higher grades just because the class got easier. Calibrated CKA dropped to zero under the null across all d/n, meaning no fake A’s.
Depth confounder (synthetic): Raw max-over-layers increased with layer count L, even with no signal—like always finding a lucky outlier in a bigger group. Aggregation-aware calibration flattened this trend to near zero.
Type-I control and power: At α=0.05, false positives stayed at or below 5% (good). When real shared signal was added, detection power rose fast (calibration didn’t throw the baby out with the bathwater).

Revisiting Platonic convergence (image–language):

Global: Uncalibrated CKA increased with model capacity, echoing prior claims. But after calibration, this rise largely vanished—like realizing everyone’s “A” was curved by an unfair baseline.
Local: Neighborhood metrics (mKNN, cycle-kNN, CKNNA) kept strong, significant alignment even after calibration. That’s like saying, “Different maps may disagree globally, but they still agree on which landmarks cluster together.”
Local distances vs. relationships: With small RBF bandwidths (very local distances), calibrated alignment didn’t hold. This shows exact spacings don’t match; it’s the neighbor lists (relationships, ranks) that align.

Extending to video–language:

Same pattern: Calibrated global CKA showed no consistent scaling trend, but local neighborhood metrics showed clear, growing alignment with larger video encoders. Smaller video encoders bottlenecked the effect.

Surprising findings:

The big surprise is that the famous global convergence trend almost disappears once you remove width and depth artifacts.
Yet the local picture becomes crisper: who-is-near-whom converges robustly across modalities, suggesting a shared relational backbone even when the whole global geometry differs.

Takeaway numbers in plain words:

Think of uncalibrated global scores as looking like “B+ to A-” across big models. Calibration often drops them closer to “C,” showing much of the previous excellence was baseline inflation.
Local metrics stay above chance and often rise with capacity, keeping their “B+” after calibration—evidence of real, robust agreement.

Bottom line: After fair calibration, the field’s story shifts from a global-unity dream to a local-relations reality.

05Discussion & Limitations

Limitations (honest assessment):

Exchangeability assumption: Calibration assumes rows (paired items) are exchangeable. If your data are grouped (e.g., multiple captions per image, or videos from the same source), you must use restricted permutations (shuffle within groups) to keep validity.
Computation: K permutations (e.g., 200–500) multiply compute time. It’s parallelizable, but you still spend extra cycles.
Choice of k and kernel bandwidth: Local metrics depend on k; too small or too large can change sensitivity. Similarly, RBF bandwidth choices affect what “local” means. Calibration helps, but hyperparameters still matter.
Bounded vs. unbounded metrics: The simple rescaling formula assumes a known maximum (often 1). For unbounded metrics, you report the positive excess over the null threshold without rescaling.
Not a silver bullet: Calibration fixes width/depth artifacts but doesn’t erase all domain shifts or sampling biases in your dataset.

Required resources:

Access to representations from both models on the same items.
Enough samples (n) so that permutation-based estimates of the null stabilize ( $K≈200$ + recommended).
Compute to run $K×(number$ of comparisons) similarity calculations, especially in aggregation-aware settings.

When not to use (or be careful):

Strong temporal or grouped dependence without proper restricted permutations—naïve shuffling breaks validity.
Extremely tiny datasets (n very small), where null estimation is unstable and effect sizes become noisy.
If your main goal is absolute distance matching (exact geometry), remember this paper’s finding: local relationships can agree without distance matching.

Open questions:

Mechanism: Why do models agree on neighbor structure across modalities? What training ingredients (data scale, objectives, augmentations) most drive neighborhood convergence?
Granularity: How fine is “local”? How does alignment change as we move from k=5 to k=1000, or as we vary kernel bandwidths?
Causality to transfer: When do shared neighborhoods best predict transfer success or cross-modal retrieval quality? Can we formalize this link?
Beyond pairwise: How do triads or higher-order relational patterns align across models? Can we generalize from neighbors to richer topologies?
Efficient calibration: Can we derive fast approximations or reuse cached nulls to cut compute while keeping guarantees?

06Conclusion & Future Work

Three-sentence summary: This paper shows that widely used similarity measurements between neural networks are biased by model width (too many features) and depth (too many layer comparisons). A simple, general null-calibration using permutations corrects these biases and reveals that global convergence largely fades, while local neighbor relationships remain robustly shared across modalities. Based on this, the authors propose the Aristotelian Representation Hypothesis: models converge on who-is-near-whom, not on an identical global blueprint.

Main achievement: Turning noisy, unfair similarity scores into calibrated, comparable effect sizes with statistical guarantees—and using that clarity to rewrite the convergence story from global to local.

Future directions: Map the exact training choices that strengthen neighborhood convergence; link calibrated local alignment to transfer learning gains; explore richer relational structures beyond nearest neighbors; and build faster calibration approximations for very large-scale studies.

Why remember this: It’s a reminder that fair baselines can flip a field’s headline—after calibration, the real shared structure isn’t the whole map, it’s the neighborhood. That sharper view helps us compare models honestly, design better multimodal systems, and ask deeper questions about what “understanding” looks like inside neural networks.

Practical Applications

•Model selection for transfer learning: Pick source models whose calibrated local neighborhood alignment with your target data is highest.
•Multimodal retrieval tuning: Optimize encoders to maximize calibrated neighborhood overlap (not raw global scores) for better cross-modal search.
•Layer matching for distillation: Use aggregation-aware calibrated maxima to find truly corresponding layers across teacher and student.
•Benchmark reporting: Publish both raw and calibrated similarities (with p-values) to provide honest, comparable results.
•Neuro-AI alignment: Prefer calibrated local metrics when comparing model representations to brain data to avoid width/depth artifacts.
•Curriculum design: Track how training stages change calibrated local vs. global alignment to guide objectives and augmentations.
•Early stopping and diagnostics: Monitor calibrated scores to detect meaningful representation changes beyond chance.
•Architecture decisions: Evaluate whether extra width or depth brings genuine (calibrated) gains or just inflates raw similarity.
•Hyperparameter search: Tune neighborhood size k or kernel bandwidth using calibrated validation curves rather than raw scores.

Version: 1