Compositional Generalization Requires Linear, Orthogonal Representations in Vision Embedding Models

Arnas Uselis; Andrea Dittadi; Seong Joon Oh

Compositional Generalization Requires Linear, Orthogonal Representations in Vision Embedding Models

Intermediate

Arnas Uselis, Andrea Dittadi, Seong Joon Oh2/27/2026

arXiv

Key Summary

•The paper asks a simple question: what must a vision model’s internal pictures (embeddings) look like if it can recognize new mixes of things it already knows?
•It defines three must-have behaviors—divisibility, transferability, and stability—and proves these force a specific geometry inside the model.
•That geometry is additive and straight: each concept adds its own piece (linearity), and different concepts don’t interfere (orthogonality).
•This gives theoretical grounding to the Linear Representation Hypothesis: linear structure isn’t just common—it’s necessary for compositional generalization under standard training with linear readouts.
•They also prove a tight capacity bound: to represent k independent concepts, you need at least k embedding dimensions (not k times the number of values).
•Across CLIP, SigLIP, and DINO models, real embeddings partly match theory: per-concept pieces are low-rank and nearly orthogonal across concepts.
•How close a model is to this geometry (measured by a whitened linearity score) strongly predicts how well it generalizes to never-seen concept combinations.
•The work predicts where bigger future models will converge: more additive structure and cleaner, more orthogonal concept directions.
•This tells builders exactly what to target if they want reliable, mix-and-match visual understanding: make embeddings additive and orthogonal across concepts.

Why This Research Matters

When models meet rare but sensible combinations in the real world, we want them to behave calmly and correctly. This work tells us exactly what inner geometry enables that: additive parts that don’t interfere, with at least one clean axis per concept. It gives builders measurable targets (linearity, orthogonality, factor rank) that predict generalization to unseen mixes, turning a vague hope into a concrete checklist. It also helps data and objective design: losses and curation that nurture these properties should deliver more reliable systems. Finally, it guides evaluation: if a model’s geometry doesn’t look additive and orthogonal, don’t expect strong compositional behavior yet. In short, this is a roadmap for dependable, mix-and-match visual understanding.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine a box of LEGO bricks. You’ve built a house and a car before, but never a flying house. If you really understand LEGO, you can snap the same familiar parts together in a brand-new way and still make sense of it.

🥬 The Concept (Compositional Generalization): Many vision systems see tons of pictures online, but they only ever see a tiny sliver of all possible combinations of objects, colors, sizes, places, and relations. The goal is to still recognize new combinations of known parts—like spotting “a person on a cat” even if training only saw “a cat on a person.” How it works (today, roughly):

A vision encoder turns an image into a vector (an embedding). 2) A simple readout (often linear) tries to tell which concepts are present (like which object, color, or size). 3) We hope this works even for unseen mixes of concepts. Why it matters: Without this, a model can ace common scenes but stumble on rare-yet-sensible ones—exactly when we most need it to be smart. 🍞 Anchor: If you ask “Is there a person?” on a funny cartoon where a tiny person rides a cat, a truly compositional model will still answer “Yes,” even if it never saw that combo in training.

🍞 Hook: You know how a good recipe book lets you combine different ingredients into lots of meals? If it only listed full meals, you’d be stuck.

🥬 The Problem: The real world is combinatorial: with k concepts (like shape, color, size, position, relation) each taking n values, there are $n^k$ possible combos—way too many to cover in training. Models succeed surprisingly often, but still fail on weird mixes, which raises a core question: what must the inner representations look like to support reliable mix-and-match? How it works (the challenge):

Training sees a subset of combinations (the “training support”). 2) We test on all combinations, including unseen ones. 3) We use linear readouts (like CLIP’s text probes) to keep the check simple. Why it matters: If we don’t know what structure makes generalization possible, we’re guessing when we build bigger models or collect more data. 🍞 Anchor: Think of a music class that only practiced certain note combinations. Which musical “rules” must students learn so they can play any new song made of the same notes?

🍞 Hook: Imagine sorting your backpack by subjects: math stuff in one pocket, art in another. Mixing them up makes it hard to find things when you need them.

🥬 Failed Attempts: Just showing more data or memorizing many combos doesn’t guarantee success. Benchmarks show even strong models struggle when test setups shuffle concept pairings they never saw together. Approaches that don’t shape the inner geometry can’t promise transfer. How it works (why memorization breaks):

Seeing many combos doesn’t cover the combinatorial universe. 2) If parts aren’t separated inside the embedding, the readout can’t reliably re-use them in new mixes. 3) So behavior flips unpredictably on rare combos. Why it matters: We need structure that guarantees part-reuse, not just luck. 🍞 Anchor: If you only memorize “dog-with-ball” and “cat-on-couch,” you’ll be confused by “cat-with-ball” unless your brain separately tracks ‘animal’ and ‘object.’

🍞 Hook: You know how graph paper keeps lines straight so you can build neat shapes? Representations need their own “graph paper” rules.

🥬 The Gap: What are the non-negotiable, model-agnostic rules the embedding must follow so a simple linear readout can read every part, transfer from a subset to the whole grid, and stay stable no matter which valid subset we trained on? How it works (framing it):

We define three desiderata—divisibility, transferability, stability (explained below). 2) We consider standard training (gradient descent with cross-entropy) and linear readouts. 3) We ask what geometry is forced if generalization truly succeeds. Why it matters: This turns “I hope it generalizes” into “It must look like this inside,” giving builders a concrete target. 🍞 Anchor: It’s like discovering that to tile a bathroom floor without gaps, tile shapes must follow strict rules. Here we uncover the “tile rules” of good embeddings.

🍞 Hook: Picture using stickers to label every part of a messy picture: one color for shape, one for color, one for size. You want each sticker set to work no matter what else is in the scene.

🥬 Real Stakes: Reliable out-of-distribution behavior matters for search (finding rare queries), accessibility (describing unusual scenes well), safety (recognizing odd but risky situations), and fairness (not failing on underrepresented concept mixes). If we know the geometry we need, we can measure it and improve it. How it works (practically):

Check if embeddings add up per concept. 2) Check if concept directions are nearly orthogonal. 3) Check if this predicts success on unseen mixes. The paper does all three. Why it matters: Clear diagnostics shorten the path from research to dependable systems used by everyone. 🍞 Anchor: A self-driving system should notice “pedestrian + scooter + unusual angle,” not just the most common traffic scenes it saw before.

02Core Idea

🍞 Hook: Imagine mixing colored lights. Red + green + blue make many new colors, but each color is still its own clean beam. If the beams interfere, your new colors get muddy.

🥬 The Aha! Moment (in one sentence): If you want a simple linear readout to correctly read every part in any mix, your image embedding must add up the parts linearly, and the part-directions must be orthogonal so they don’t interfere. How it works:

Define three must-haves: divisibility (every combo has a home), transferability (train on a small but valid subset, test on all), stability (retrain on any valid subset, predictions don’t wiggle). 2) Under standard training (GD + cross-entropy) with linear readouts, these force a max-margin geometry. 3) That geometry implies additivity (sum of per-concept vectors) and orthogonality (cross-concept difference directions at right angles). 4) Also, you need at least k dimensions for k concepts. Why it matters: This gives a necessary blueprint for representation geometry—no more guessing what structure to aim for. 🍞 Anchor: Like writing a song with separate tracks (drums, bass, vocals). Add them to make the full song (additivity), and keep tracks separate so turning up drums doesn’t distort vocals (orthogonality).

🍞 Hook: You know how LEGOs click together? Each brick keeps its shape when you build a castle or a spaceship.

🥬 Analogy 1 (LEGOs): Concepts are bricks. Additivity says the final build is the sum of the bricks; orthogonality says bricks don’t deform each other’s shape. How it works:

Each concept value is a vector piece. 2) A picture is the sum of the chosen pieces. 3) Orthogonality ensures pieces fit without warping. Why it matters: Then a single linear readout can find each brick in any build. 🍞 Anchor: If the model saw “red square” and “blue circle,” it can still spot “red circle” and “blue square” because “red,” “blue,” “square,” and “circle” each keep their own direction.

🍞 Hook: Picture a choir. Sopranos, altos, tenors, and basses sing different lines that harmonize.

🥬 Analogy 2 (Choir): Each voice part is a concept. Additivity mixes voices into the final song; orthogonality keeps parts distinct so a microphone can pick each voice with a simple setting. How it works: 1) Voices add, 2) Parts don’t cancel each other, 3) A linear knob can lift any part. Why it matters: If parts blurred, a mic couldn’t isolate a single voice in a new arrangement. 🍞 Anchor: Even if you rearrange who sings which verse (new combo), you can still identify the soprano line.

🍞 Hook: Think of graph paper axes: left-right vs. up-down. Moving along one axis doesn’t change your place on the other.

🥬 Analogy 3 (Axes): Each concept is its own axis. Additivity moves along multiple axes at once; orthogonality means axes are at right angles, so moving on one axis doesn’t shift you on the others. How it works: 1) Pack k axes in at least k dimensions. 2) Place each concept’s values along its axis. 3) Any combo is a point with coordinates from each axis. Why it matters: A linear readout can recover each coordinate cleanly, even for unseen points. 🍞 Anchor: Change color (move along color-axis) while size stays put; the size-readout won’t be fooled because size lives on a different axis.

🍞 Hook: You know how detectives rely on rules: separate evidence, don’t let clues contaminate each other.

🥬 Before vs. After: Before, people often observed linear structure and near-orthogonality and guessed they were helpful. After, this paper shows: under common training and linear readouts, that structure isn’t optional; it’s required by the three desiderata. How it works:

GD + cross-entropy tends toward max-margin separators (SVM behavior). 2) Stability across different valid training subsets forces per-concept decisions to ignore irrelevant changes. 3) That invariance yields additive shifts per concept and orthogonal difference directions. Why it matters: We shift from “it seems to happen” to “it must happen” if you truly generalize compositionally. 🍞 Anchor: If changing the hat color shouldn’t affect the shape prediction, then the “color” vector must be perpendicular to the “shape” vector so the shape-readout doesn’t see it.

🍞 Hook: Imagine fitting many straight drinking straws into a box without bending them. Each straw needs its own direction.

🥬 Building Blocks:

What it is: The recipe for reliable mix-and-match is (a) additive concept factors, (b) cross-concept orthogonality, (c) at least k dimensions for k concepts.
How it works:
1. Divisibility: ensure every concept combo is representable by the readout.
2. Transferability: train on a valid slice, generalize to the whole grid.
3. Stability: predictions don’t change across valid retrainings.
4. Under GD+CE with linear heads: these force additivity + orthogonality.
5. Capacity: need $d ≥ k$ dimensions (tight bound) to host k independent axes.
Why it matters: Miss any piece, and linear readouts can fail on unseen mixes or flip answers between retrainings. 🍞 Anchor: If you try to fit 5 independent straws (concepts) into a 3D box (d=3), some must bend or overlap; then you can’t point to each straw cleanly.

03Methodology

🍞 Hook: Think of a neat assembly line. Ingredients go in, stations add their part, and a final inspector checks everything quickly.

🥬 Overview: At a high level: Image → Encoder f → Embedding z → Linear readouts (one per concept) → Predictions for every concept. How it works:

Define a concept space (all combos of values for k concepts). 2) Decide which combos appear in training (a training support T). 3) Train linear readouts with gradient descent on cross-entropy. 4) Demand three behaviors: divisibility, transferability, stability. 5) Analyze what geometry of z makes all this possible and necessary. 6) Test models to see if geometry shows up and predicts generalization. Why it matters: This turns the fuzzy idea of “be good on unseen mixes” into a checkable recipe. 🍞 Anchor: Like building a sandwich: bread (input), fillings (concept pieces), knife test (linear readout). If the fillings don’t stack neatly, the knife test won’t slice cleanly.

Step A: Concept space and training supports

What happens: We model the world as k concepts, each with n values; the full grid has $n^k$ combinations. Training only sees a subset T (the support), like 10% of all combos.
Why this step exists: To reflect reality—data never covers all combos—yet we still want full-grid competence.
Example: Concepts = {shape, color, size}. If n=3 each, full combos = 27; we might train on 14 and test on all 27.
What breaks without it: If we pretend we see everything, we learn nothing about transfer.

Step B: Three desiderata (must-have behaviors)

What happens: We require 3 properties: • Divisibility: a linear readout can assign any valid value to each concept; every combo has a decision region. • Transferability: train on any valid T, then correctly classify all concepts on all combos. • Stability: retraining on any valid T doesn’t change per-concept posteriors (ideally).
Why this step exists: It nails down what we mean by “works compositionally” in practice.
Example: Shape-choice shouldn’t depend on which colors were seen in training; and predictions shouldn’t flip if you retrain on a different valid subset.
What breaks without it: Without stability, answers wobble; without transferability, you can’t handle new mixes; without divisibility, some combos are unreachable.

Step C: Training rule—GD + cross-entropy with linear readouts

What happens: Optimize linear heads per concept using standard cross-entropy. In the binary case, this converges to max-margin separators (SVM behavior).
Why this step exists: This is how CLIP-like and linear-probe setups train in practice; analyzing it gives relevant guarantees.
Example: For shape, learn weights so “square” vs “circle” are separated with maximum margin in z-space.
What breaks without it: With arbitrary nonlinear heads, memorization can fake compositionality; our necessity claims focus on the common linear-readout regime.

Step D: The secret sauce—stability + max-margin ⇒ additivity + orthogonality

What happens: If changing irrelevant concepts should not shift a concept’s decision (stability), and the classifier is max-margin, the easiest consistent solution is that flipping one concept moves z by adding a fixed vector for that concept. To keep other concepts unaffected, these concept-difference vectors must be orthogonal.
Why this step exists: It’s the heart of the proof: the desiderata and the training rule force the geometry.
Example: Flipping “color” adds a “color vector.” For shape to ignore this, shape’s separator must be perpendicular to the color-difference direction.
What breaks without it: If directions aren’t orthogonal, a color change can push a point across a shape boundary—bad for transferability and stability.

Step E: Capacity bound— $d ≥ k$

What happens: Prove you need at least k dimensions to house k independent concept axes (regardless of how many values each concept has).
Why this step exists: It tells us which embeddings are big enough to even attempt perfect compositionality.
Example: With 6 concepts, a 4D embedding can’t cleanly separate them; some will squish together.
What breaks without it: If d<k, even perfect training can’t rescue you; some concepts must share directions and will interfere.

Step F: Measuring geometry in real models

What happens: Given embeddings for a dataset whose concepts we know, we:
1. Recover per-concept factors by averaging over examples sharing that value.
2. Measure linearity with a whitened, projected $R^2$ score: how well do sum-of-factors reconstruct embeddings? (Projection/whitening avoids inflating scores by unrelated or dominant directions.)
3. Measure orthogonality with cosine similarities of per-concept difference directions; lower cross-concept cosine means more orthogonal.
4. Measure factor rank using PCA: how many principal components explain 95% of per-concept variation?
Why this step exists: To test if real models show the necessary geometry and whether it predicts generalization.
Example: On dSprites, check if changing size adds a near-constant vector and if that vector is perpendicular to the color-change vector.
What breaks without it: Without fair, normalized measures, we can be fooled by scale or unrelated features.

Step G: Link geometry to generalization

What happens: Train linear probes on 10% of combos and test on held-out 10% unseen mixes. Correlate generalization accuracy with the linearity score and with orthogonality.
Why this step exists: To validate that “more additive + more orthogonal” really means “better compositional transfer.”
Example: Models with higher whitened $R^2$ get higher unseen-combo accuracy across datasets.
What breaks without it: We wouldn’t know if the geometry is just pretty or actually useful.

The secret sauce in one line: Stability across training supports + max-margin solutions make the only robust choice to be “add pieces and keep them perpendicular,” which is exactly the geometry a linear readout can reuse on new mixes.

04Experiments & Results

🍞 Hook: Think of a science fair where you judge bridges by how much weight they hold and how well they keep their shape. Here, the “bridge” is the model’s embedding geometry.

🥬 The Test: They measured three things and their connection to success on unseen combinations.

What they measured:
1. Linearity (whitened, projected $R^2$ ): How well does a sum of per-concept factors reconstruct embeddings? 1.0 would be perfect additivity.
2. Orthogonality: How small are cosines between difference vectors of different concepts? Smaller means more perpendicular.
3. Factor rank: How many PCs explain 95% of variance per concept? Lower rank means tighter, more efficient packing.
4. Compositional accuracy: Train linear probes on 10% of combos, test on held-out unseen 10%, average per-concept accuracies.
Why those: If the theory is right, more additive + more orthogonal should mean better mix-and-match generalization. 🍞 Anchor: Like checking if separate choir parts stay distinct (orthogonality), if the full song is the sum of parts (linearity), and if that predicts how well the choir sings a brand-new arrangement (unseen combos).

🥬 The Competition: They tested a wide family of encoders.

Vision-language: OpenAI CLIP, OpenCLIP, MetaCLIP, MetaCLIP2, SigLIP, SigLIP2.
Vision-only: DINO v1–v3.
Datasets: PUG-Animal (photorealistic, controllable), dSprites (shapes with factors like position, size, rotation), MPI3D (3D scenes with controlled factors), and ImageNet-AO (attribute–object).
Baseline: Randomly initialized OpenCLIP ViT-L/14 with linear probes (sanity check that effects aren’t due to dimension or chance). 🍞 Anchor: It’s like comparing many brands of bikes on the same obstacle courses and including a wobbly unassembled bike as a baseline.

🥬 The Scoreboard (with context):

Linearity: Real models scored roughly 0.4–0.6 on the whitened, projected $R^2$ scale across datasets—well above random baselines (about 0.12–0.42) but not near 1.0. That’s like getting a strong B on additivity when random gets an F.
Orthogonality: Within a concept, directions were similar; across concepts, cosines were small (~0.09–0.12), much lower than in random encoders (~0.32). That’s like saying voices in the same section blend (as they should), but different sections stay distinct.
Factor rank: Many ordinal/continuous concepts had low effective rank (often 1–4 PCs $for ≥95$ % variance). Discrete multi-attribute concepts showed higher rank. Across different model families, the variance curves per concept were strikingly similar—suggesting convergent geometry.
Accuracy link: Higher linearity scores matched higher compositional accuracy on unseen mixes across datasets and models. The random baseline sat in the low-linearity/low-accuracy corner.
Zero-shot echoes: Using text probes directly, the same relationships showed up on PUG-Animal and ImageNet-AO, backing the linear-probe findings. 🍞 Anchor: Models whose music tracks add cleanly and don’t step on each other’s toes sing new songs more accurately.

🥬 Surprising Findings:

Partial, not perfect: Today’s best models aren’t fully additive/orthogonal—there’s headroom. Yet the amount they are is already predictive of how well they generalize compositionally.
Cross-model similarity: Factor spectra by concept looked remarkably alike across architectures, hinting at shared “Platonic” structure.
Dimensionality nuances: Empirically, Euclidean CE training approaches the $d ≥ k$ bound most closely; BCE often needs closer to ~2k; spherical constraints add roughly an extra dimension—useful hints for designers choosing losses/geometries. 🍞 Anchor: Different choirs learned similar voice-shapes independently, and the tighter their parts stayed, the better they tackled brand-new songs.

05Discussion & Limitations

🍞 Hook: Building a reliable LEGO castle is easier when each brick is crisp; it’s harder if some bricks are squishy.

🥬 Limitations:

Exact stability is idealized: Real training won’t make per-concept posteriors perfectly identical across all valid supports; results hold approximately.
Linear-readout focus: The necessity results target the standard regime (linear probes, CLIP-like heads). Highly nonlinear heads can memorize; then the guarantees don’t apply.
Binary-to-multi extension: Proofs are cleanest in the binary case; multi-valued results are partly constructive/empirical rather than fully closed-form.
Fixed encoder assumption in analysis: Theory analyzes changing readouts across supports with a fixed encoder; in practice, encoders are trained once. The gap is discussed as future work.
Partial emergence: Current models show partial additivity/orthogonality; improving to the ideal is an open engineering challenge.

Resources required:

Datasets with known factor structure (like dSprites, MPI3D, controllable synthetic datasets, and attribute–object pairs).
Access to many pretrained checkpoints (CLIP families, SigLIP, DINO variants) and compute for probing/whitening/PCA.
Linear probing infrastructure and robust evaluation splits over unseen combinations.

When not to use:

If you rely on heavy nonlinear heads trained end-to-end for a specific closed world, these constraints won’t diagnose your system well.
If your downstream task isn’t about reusing parts in new mixes (e.g., single fixed mapping), compositional diagnostics may add little value.
If your embedding dimension is clearly below the number of independent concepts (d < k), don’t expect clean factor recovery or robust transfer.

Open questions:

How close to perfect can large-scale training get on real, messy data? What data curation or objectives push embeddings toward the ideal geometry fastest?
Can we design training-time regularizers that directly encourage per-concept additivity and cross-concept orthogonality without hurting overall performance?
How do these conditions interplay with binding (who-has-which-attribute-where) and spatial relations beyond per-concept labels?
Can we automatically discover the concept axes in fully wild data, then align them across models for cross-model transfer and interpretability? 🍞 Anchor: We know what the perfect bricks look like; now we need manufacturing tricks to make more of them, faster, and at scale.

06Conclusion & Future Work

🍞 Hook: Think of a mixing board where each slider controls a different instrument. To play any new song, you need clean, independent sliders that add up to the whole sound.

🥬 Three-sentence summary: This paper shows that to truly generalize to unseen combinations with linear readouts under standard training, vision embeddings must be additive sums of per-concept parts and those parts’ difference directions must be orthogonal. It proves these are necessary (and in the key binary case, also sufficient) and that you need at least k dimensions for k concepts. Experiments confirm modern models partly match this geometry, and the closer they get, the better they handle never-seen mixes. Main achievement: Turning “good compositional behavior” into a precise geometric target—linear factorization with cross-concept orthogonality—grounding the Linear Representation Hypothesis as a consequence, not just an observation. Future directions: Create training objectives and data strategies that directly encourage additive, orthogonal concept axes; extend from labels to bindings and relations; build diagnostics that automatically reveal and align concept subspaces across models. Why remember this: It’s a blueprint for dependable mix-and-match intelligence—know the geometry you need, measure it, and train toward it. 🍞 Anchor: If you want a camera that can describe any scene made of known parts, build its inner picture so each part is its own slider, and sliders don’t bump into each other when you move them.

Practical Applications

•Design diagnostics: add a compositional health report (linearity score, cross-concept cosine, factor ranks) to model eval dashboards.
•Data curation: ensure supports contain counterfactual pairs (flip one concept at a time) to encourage clean concept axes.
•Objective tuning: prefer setups (e.g., CE in Euclidean space) that empirically approach the d ≥ k bound and cleaner geometries.
•Probe engineering: align text prompts and image factors; measure probe-span linearity after whitening to monitor zero-shot readiness.
•Model selection: pick checkpoints with higher linearity/orthogonality scores for tasks requiring strong out-of-distribution mixing.
•Capacity planning: ensure embedding dimension d ≥ number of target concepts k to avoid inevitable interference.
•Regularization research: explore penalties that encourage orthogonality between learned per-concept directions.
•Representation alignment: align factor subspaces across models to improve transfer and interpretability between encoders.
•Benchmarking: adopt train-on-subset/test-on-unseen-combos splits and report geometry metrics alongside accuracy.
•Interpretability: recover per-concept factors to visualize how size, color, or position shift embeddings and tool them for debugging.

Version: 1