COMPOT: Calibration-Optimized Matrix Procrustes Orthogonalization for Transformers Compression

Denis Makhov; Dmitriy Shopkhoev; Magauiya Zhussip; Ammar Ali; Baher Mohammad; Stamatios Lefkimmiatis

COMPOT: Calibration-Optimized Matrix Procrustes Orthogonalization for Transformers Compression

Intermediate

Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip et al.2/16/2026

arXiv

Key Summary

•COMPOT is a training-free way to shrink Transformer models while keeping their smarts.
•It learns a small set of clean, right-angled directions (an orthogonal dictionary) and mixes only a few of them to rebuild each weight column (sparse coding).
•Thanks to orthogonality, both updates are closed-form: a quick Procrustes step for the dictionary and simple hard-thresholding for the codes—no slow, iterative loops.
•A tiny calibration set is used to “whiten” the space so the compression protects what matters for real inputs.
•COMPOT also spreads the compression budget across layers in one shot by pooling and trimming the smallest singular values globally (with safety guards).
•Across language, vision-language, and speech models, COMPOT beats strong low-rank SVD and classic dictionary-learning baselines at the same memory budgets.
•It plays nicely with post-training quantization (like GPTQ 4-bit) to reach extreme compression without falling apart.
•The method is simple, fast, and deterministic, making it practical for billion-parameter models.
•Limitations include reliance on good calibration data and assumptions about well-behaved whitening.
•Bottom line: COMPOT shifts model compression beyond single-subspace SVD toward a union-of-orthogonal-subspaces that preserves accuracy better.

Why This Research Matters

Smaller, accurate models are easier to run on phones, laptops, and edge devices, bringing powerful AI to more people and places. Lower memory and compute cut costs and energy use, making deployments greener and more affordable. By staying training-free, COMPOT avoids long retraining cycles and complicated hyperparameter hunts, speeding up production timelines. Its one-shot allocation and closed-form updates make it predictable and robust—important for large teams and large models. Because it composes well with quantization, it fits into existing toolchains for extreme compression. The method also generalizes across text, vision-language, and speech, so one idea helps many applications. Overall, COMPOT makes “smart, small, and fast” a practical goal rather than a trade-off.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how you can pack a suitcase in different ways? If you fold cleverly, you can fit more clothes without squishing your favorite shirt.

🥬 Filling (The Actual Concept)

What it is: Model compression is folding a big AI model so it fits into smaller memory while still working well.
How it works: We choose what to keep sharp, what to fold gently, and what to leave out, guided by a small sample of real use (calibration).
Why it matters: Without careful folding, the model forgets important skills, like a squished shirt that looks wrinkly when you arrive.

🍞 Bottom Bread (Anchor) Imagine running a chatbot on a phone: compression lets it fit and respond fast without losing its ability to answer questions.

The World Before

Transformers grew huge and powerful, but also heavy. They need lots of memory and bandwidth, which makes them costly to deploy on edge devices or to serve at scale.
Popular post-training tricks tried to make them smaller without retraining: low-rank SVD, pruning, and quantization. These work, but often trade accuracy for size—especially when pushed hard.

🍞 Top Bread (Hook) Imagine trying to describe every picture with the same small set of crayons.

🥬 Filling (The Actual Concept: Low-Rank Factorization)

What it is: Low-rank factorization rewrites a big matrix as two skinny ones that share a single small subspace for all columns.
How it works: It finds the most important directions and keeps only those, discarding the rest.
Why it matters: It’s simple and fast, but it forces every column to live in the same tiny crayon box, which can be too rigid.

🍞 Bottom Bread (Anchor) In a Transformer, using one shared subspace can miss details when different neurons need different “color mixes.”

The Problem

SVD-based compression assumes one shared subspace per weight matrix. But Transformer columns (features) can belong to different local subspaces—some are sharp, others fuzzy.
Result: Even moderate compression can degrade accuracy because the single subspace can’t fit everyone comfortably.

Failed Attempts

Classic dictionary learning (like K-SVD with OMP) is more flexible: each column can pick a few atoms from a big dictionary. But it needs slow, iterative updates—not great at billion-parameter scale.
Advanced SVD methods tried smarter truncation with calibration and activation weighting. Helpful, but still stuck with the single-subspace limit.

🍞 Top Bread (Hook) You know how organizing your desk with neatly arranged trays makes finding things easier and faster?

🥬 Filling (The Actual Concept: Orthogonal Dictionary Learning)

What it is: Learn a set of right-angled directions (orthogonal atoms) and let each column pick only a few.
How it works: Orthogonality keeps the directions independent; picking only a few keeps storage small.
Why it matters: It’s flexible like a union-of-subspaces, but clean and fast because orthogonality simplifies math.

🍞 Bottom Bread (Anchor) Different columns choose different trays; everything is tidy and quick to access.

The Gap

We needed a method with the flexibility of sparse dictionaries, the speed of SVD, and the wisdom to protect what matters for the task.

Real Stakes

Smaller, faster models bring chatbots to phones, vision-language assistants to cameras, and speech recognition to embedded devices.
Less memory and compute means lower costs and energy use, broader access, and greener AI—without retraining.

🍞 Top Bread (Hook) Think of taste-testing a soup before serving.

🥬 Filling (The Actual Concept: Calibration)

What it is: Use a tiny sample of real inputs to measure what matters.
How it works: Record activations entering each linear layer and optimize to keep the outputs close after compression.
Why it matters: Without calibration, you might protect the wrong parts, like adding sugar when it needed salt.

🍞 Bottom Bread (Anchor) Using a few hundred sentences to guide compression helps the model answer questions about those kinds of sentences just as well.

02Core Idea

The “Aha!” Moment (One Sentence) Use an orthogonal dictionary in a whitened, calibration-aware space so both the dictionary and sparse codes have closed-form updates; then allocate compression globally in one shot by pooling normalized singular values—no iterative training needed.

Multiple Analogies (3 Different Ways)

Toolbelt analogy: Instead of one shared tool for all jobs (SVD), give each task its own tiny tool set (sparse codes) chosen from a neatly arranged rack of non-overlapping tools (orthogonal dictionary).
Closet analogy: SVD is one uniform shelf; COMPOT is several tidy, right-angled shelves where each shirt (column) picks a few spots; global allocation is deciding which rarely worn items to box up, across the whole closet, not per shelf.
Choir analogy: SVD forces all singers into one small harmony; COMPOT lets each singer pick a few independent notes from a clean scale; dynamic allocation quiets the softest, least-heard notes across the choir to hit a noise budget.

Before vs After

Before: SVD compresses fast but can miss important variations; K-SVD compresses flexibly but is slow.
After: COMPOT keeps flexibility (union-of-subspaces) and gains speed (closed-form updates), while using calibration to protect behavior and a simple global rule to spread the budget sensibly.

Why It Works (Intuition, No Equations)

Whitening makes the space “fair,” so differences we measure reflect true task importance, not scaling quirks.
Orthogonality turns the dictionary update into just “rotate to match” (Procrustes) and the coding into “pick the top s and zero the rest” (hard-thresholding). No chasing your tail with iterative pursuits.
Pooling normalized singular values across layers ranks what’s least important globally. Trimming the smallest ones first, with guards, reduces error where it matters least.

Building Blocks (with Sandwich Explanations)

🍞 Hook: Checking glasses before class. 🥬 Whitening
- What: Rescale directions so all axes are equally trusted.
- How: Use calibration activations to build a transform that evens out the space; then compress in that space.
- Why: Without it, you might protect the loudest but not the most meaningful directions. 🍞 Anchor: Like adjusting volume levels so you judge singing quality, not just loudness.
🍞 Hook: Rotating a puzzle piece to fit a spot. 🥬 Orthogonal Procrustes
- What: Find the best rotation to align two matrices.
- How: Compute a thin SVD once and set the dictionary as the rotation that best matches the data.
- Why: Without this, you’d adjust each atom slowly and separately. 🍞 Anchor: One smooth twist instead of many tiny nudges.
🍞 Hook: Taking quick notes in class. 🥬 Sparse Coding via Hard-Thresholding
- What: Represent each column using only a few largest coefficients.
- How: Project onto the dictionary and keep the s biggest entries, zero the rest.
- Why: Without sparsity, storage balloons and noise creeps in. 🍞 Anchor: You jot only key points, not every word.
🍞 Hook: Splitting a pizza where everyone gets what they need. 🥬 One-shot Dynamic Allocation
- What: Decide layer-wise compression by trimming the smallest singular values globally.
- How: Normalize each weight’s spectrum, pool them, cut from the bottom up with per-layer guards.
- Why: Without this, fragile layers might be over-cut while redundant ones get off easy. 🍞 Anchor: The least tasty crust bits go first, evenly across the table.
🍞 Hook: Turning HD video into a lighter file but still crisp. 🥬 Post-Training Quantization (PTQ)
- What: Store numbers with fewer bits after training.
- How: Methods like GPTQ approximate the original weights well even at 4 bits.
- Why: Without it, memory stays large even after factorization. 🍞 Anchor: A good compressor that keeps the picture sharp.

03Methodology

Overview: Input → Steps → Output

Input: Pretrained Transformer weights, a small calibration set of activations per linear layer, and a target global compression ratio.
Steps: (1) Collect calibration activations → (2) Whitening → (3) Initialize orthogonal dictionary → (4) Alternate: closed-form sparse coding + Procrustes dictionary update → (5) Dewhiten and store factors → (6) One-shot dynamic allocation across layers (if enabled) → (7) Optional: apply PTQ like GPTQ.
Output: For each matrix W, store A (dewhitened orthogonal dictionary) and a sparse code matrix S (with values and a mask), meeting the memory budget.

Step-by-Step (What, Why, Example)

Collect calibration activations

What: Run a few hundred examples through the model to record inputs X to each linear projection.
Why: These examples tell us which outputs must be preserved.
Example: 256 text sequences of length ~1k tokens from a general corpus.

Whitening

🍞 Hook: Leveling the playing field.
🥬 What: Build a transform that evens out directions using the Gram matrix of X.
How: Compute a stable factor (e.g., Cholesky; SVD if needed) and map W to a whitened space.
Why: Keeps the compression objective aligned with functional error, not just raw weight error.
🍞 Anchor: Like grading essays after removing name labels so every student is judged fairly.

Initialize an orthogonal dictionary

What: Start with k right-angled columns ( $k ≤ input$ dimension), e.g., top singular vectors or an orthonormalized random basis.
Why: Good starts converge faster and better; SVD init usually wins.
Example: For a $2048×8192$ up-projection, choose k around the storage budget and set D to the top 2– $4×s$ basis directions.

Sparse coding (closed-form)

🍞 Hook: Packing only essentials.
🥬 What: For each column, project onto D and keep the s biggest entries → a column-sparse code S.
Why: Reduces storage and noise; without it, codes get dense and memory savings vanish.
🍞 Anchor: You pack only the most-worn clothes.

Dictionary update via Orthogonal Procrustes (closed-form)

🍞 Hook: Turning a key to fit a lock just right.
🥬 What: Compute M = (whitened W) × $S^T$ , take its thin SVD, and set D = P $Q^T$ .
Why: This aligns D to best reconstruct in one shot; without it, iterative per-atom tweaks are slow.
🍞 Anchor: One precise twist instead of many wiggles.

Alternate steps 4–5 for T iterations

What: Repeat hard-thresholding then Procrustes (often ~20 rounds suffice with SVD init).
Why: Each pass sharpens codes and dictionary; too few passes underfit, too many bring diminishing returns.
Example: Accuracy improves quickly in the first 20–100 rounds, then plateaus.

Dewhiten and store

What: Map D back to the original space to get A; store A (dense FP16) and S (only nonzeros + mask, FP16 for values, 1-bit mask packed).
Why: Inference uses A and S to reconstruct outputs efficiently.
Example: Memory accounting includes A ( $m×k$ ), S (s per column), and the mask.

One-shot dynamic compression allocation (global, optional but recommended)

🍞 Hook: Trimming the tiniest twigs across a forest, not one tree at a time.
🥬 What: Normalize each W by its Frobenius norm, pool all singular values, and remove the smallest until the global budget is met, with per-layer min/max guards and skipping matrices that don’t benefit from factorization.
Why: Avoids over-compressing sensitive layers and under-compressing redundant ones; no iterative search needed.
🍞 Anchor: Everyone chips in fairly based on what they can spare.

Optional: Post-training quantization (e.g., GPTQ 4-bit)

🍞 Hook: Zipping files after reorganizing them.
🥬 What: Quantize the final factors to fewer bits while preserving accuracy.
Why: Combines structural compression with low-bit storage for extreme savings.
🍞 Anchor: You pack neatly (factorization) and then vacuum-seal (quantization).

The Secret Sauce

Orthogonality + whitening turns two hard subproblems into two closed-form steps.
Union-of-orthogonal-subspaces gives flexibility beyond a single SVD subspace but stays fast and stable.
Global pooled singular-value trimming matches where error hurts least, with simple safety guards.
Everything is training-free and deterministic—no gradients, no finicky tuning.

04Experiments & Results

The Test: What and Why

They measured zero-shot accuracy on many language benchmarks (PIQA, HellaSwag, LAMBADA, ARC-e/ch, SciQ, RACE, MMLU) and perplexity on corpora like WikiText and C4.
They also checked multimodal (vision-language with Qwen-VL) and speech (Whisper) to show generality.
Why: Accuracy and perplexity together reveal if the model still “thinks” well after being shrunk.

The Competition

Low-rank SVD methods: SVD-LLM and SVD-LLM V2 (strong baselines with calibration-aware tricks and dynamic rank allocation).
Sparse dictionary baseline: CoSpaDi (K-SVD/OMP style, accurate but slower and iterative).
Training-based rank selection: Dobi-SVD (differentiable truncation), which is not strictly training-free.
Structured pruning: ReplaceMe and LLM-Pruner.
Quantization: GPTQ alone and combined with factorization.

The Scoreboard (with Context)

On Llama3-8B at 0.2 compression (static per-layer), COMPOT beat SVD-LLM and CoSpaDi on average accuracy and perplexity; think of getting an A when others get a B to B+.
With dynamic allocation (full COMPOT), it outperformed SVD-LLM V2 (as reproduced) and Dobi-SVD under matched protocols, while remaining training-free.
On Qwen3-VL-8B-Instruct (vision-language), SVD-LLM degraded heavily, but COMPOT kept strong scores across MMMU, OCRBench, RealWorldQA, and MMStar—like keeping most of the picture crisp after file-size reduction.
On Whisper Large V3 (speech), COMPOT preserved lower word error rate than SVD-LLM under the same compression, especially as compression got stronger.
With GPTQ-4bit on top, COMPOT+GPTQ achieved lower perplexity than GPTQ-only and SVD-LLM V2+GPTQ at equal weight memory (2.8 GB), showing the methods add up.

Surprising Findings

A simple, global “pool singular values and trim the smallest” allocation—with basic min/max guards—worked better than more complex iterative searches in several settings.
SVD-based dictionary initialization converged fast, so only about 20 iterations gave a great accuracy–time trade-off.
Orthogonal dictionaries with hard-thresholding provided accuracy close to (and often better than) heavier, overcomplete K-SVD pipelines—at much lower runtime.

What the Numbers Mean Practically

At moderate compression (e.g., CR=0.2–0.3), COMPOT gives noticeably higher accuracy than SVD-LLM and lower perplexity, roughly like moving from a B- to a B+/A- across diverse tasks.
In edge or server settings, that gap can decide if you can deploy at half the memory with confidence.

A Note on Fairness of Comparisons

The paper carefully aligns protocols (datasets, token lengths, scripts) and reports both reproduced and paper-reported numbers when public code isn’t fully available.
They also explain why comparing with training-based or remapping-heavy methods can be tricky when the goal is strictly training-free, structural compression.

05Discussion & Limitations

Limitations (Be Specific)

Calibration dependence: If the calibration set is tiny or unrepresentative, whitening and importance estimates can mislead compression, hurting rare domains.
Whitening assumptions: Cholesky needs a well-conditioned Gram matrix; with small or noisy data you may need SVD-based whitening and mild regularization.
Fixed sparsity pattern: Using a set s of nonzeros per column is simple but not always optimal; learning patterns could help but costs more.
Storage accounting: You must store A, nonzeros of S, and a mask; at very low dimensions some matrices may be better left dense.

Required Resources

A small GPU memory window to run calibration forward passes (hundreds of samples typically suffice).
Thin SVDs per iteration for Procrustes on $m×k$ matrices and projections for hard-thresholding—much lighter than K-SVD/OMP, heavier than a single SVD but still practical.
Engineering for mask packing and compatibility with your inference backend.

When NOT to Use

If you can fully retrain or fine-tune with large datasets and time, learned distillation or training-based compression may outperform.
If your use-case is extremely sensitive to worst-case errors in layers flagged as “dense,” a conservative SVD or quantization-only approach might be safer.
If calibration data is unavailable or severely mismatched to deployment, stick to safer uniform compression or just quantization.

Open Questions

Can we adaptively learn the sparsity pattern (which atoms per column) without losing closed-form speed?
How best to mix COMPOT with KV-cache compression and activation quantization for end-to-end gains?
Can we build task-specific guards for the allocation step using richer importance signals beyond singular values?
What are the best defaults for k/s across diverse architectures and hardware backends?
How does COMPOT behave under heavy distribution shift (e.g., code vs. prose vs. math) and can lightweight recalibration fix it?

06Conclusion & Future Work

3-Sentence Summary COMPOT compresses Transformer layers by learning an orthogonal dictionary in a calibration-whitened space and encoding each column with only a few atoms, using closed-form Procrustes and hard-thresholding—no training required. It then spreads a global compression budget across layers in one shot by trimming the smallest normalized singular values with safety guards. Across language, vision-language, and speech, COMPOT delivers a better accuracy–compression trade-off than strong SVD and sparse baselines and pairs well with post-training quantization.

Main Achievement Shifting from a single-subspace SVD view to a union-of-orthogonal-subspaces with closed-form updates, making flexible sparse factorization as practical and fast as classic low-rank methods.

Future Directions

Learn or refine sparsity patterns while keeping near-closed-form speed.
Tighten the allocation step with richer importance signals and per-task constraints.
Co-design with quantization and runtime kernels for even bigger speedups.
Explore lightweight “healing” passes after factorization for a small extra boost.

Why Remember This It shows you can keep accuracy high without retraining by combining three simple ideas—whitening, orthogonality, and global allocation—turning a traditionally slow sparse approach into a fast, deterministic, training-free recipe that scales to billion-parameter Transformers.

Practical Applications

•Deploy LLMs on consumer GPUs or edge devices by compressing encoder/decoder projections with COMPOT and then applying GPTQ 4-bit.
•Serve chatbots at lower cost by using COMPOT with one-shot dynamic allocation to hit strict memory budgets while preserving accuracy.
•Compress vision-language models’ language backbones to enable on-device captioning and OCR without cloud access.
•Shrink speech recognition decoders (e.g., Whisper) for real-time transcription on embedded systems.
•Create a fast compression pipeline for A/B tests: run COMPOT with SVD initialization, 20 iterations, and global allocation guards.
•Retrofit existing SVD compression stacks by swapping the SVD step with COMPOT’s orthogonal dictionary + sparse coding.
•Bundle COMPOT into model release workflows to produce compressed, quantized variants alongside full-precision checkpoints.
•Use calibration subsets tailored to target domains (e.g., code, math, dialogue) to specialize compressed models without retraining.
•Adopt the pooled singular-value allocation to quickly rebalance compression when architectures change or layers are added.
•Integrate early stopping by monitoring relative reconstruction error to reduce compression wall-clock time further.

Version: 1