Spectral Condition for $μ$P under Width-Depth Scaling

Chenyu Zheng; Rongzhen Wang; Xinyu Zhang; Chongxuan Li

Spectral Condition for $μ$P under Width-Depth Scaling

Intermediate

Chenyu Zheng, Rongzhen Wang, Xinyu Zhang et al.2/28/2026

arXiv

Key Summary

•Big AI models keep getting wider (more neurons per layer) and deeper (more layers), which often makes training unstable and hyperparameters hard to reuse.
•This paper gives a single, simple “spectral” rule that tells you how big weights and their updates should be when you scale both width and depth.
•The rule works for residual networks and unifies many previous, separate μP (maximal update parametrization) recipes into one framework.
•It explains why two-layer residual blocks need a 1/L residual multiplier (stronger shrinkage) while one-layer blocks only need 1/√L, resolving past confusion.
•Using the rule, you can map directly to practical hyperparameters for many optimizers (like SGD, AdamW, Muon-Kimi) instead of guessing.
•In GPT-2–style models, the rule keeps feature sizes stable as models grow and lets you transfer learning rates from small to big models without retuning.
•The theory uses only basic linear algebra and probability (spectral norms, RMS) so it’s easier to follow and extend to new optimizers.
•Experiments show μP from the spectral rule beats standard parameterization (SP) as width and depth increase, with lower loss and steadier behavior.

Why This Research Matters

Large models are expensive to train, and retuning hyperparameters after every size change wastes time and compute. A single spectral rule that keeps features stable and updates effective means you can scale models confidently. It also makes results more predictable across teams and architectures because everyone measures sizes the same way. By unifying past μP recipes and mapping to many optimizers, the framework lowers the barrier to robust scaling. In practice, this helps build better language and generative models faster, with fewer failed runs. The approach is simple enough to extend, so it can keep pace with new optimizers and architectures.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how stacking more and more books on a wobbly shelf can make the whole thing tip over unless you also use stronger brackets? Making AI models both wider (more neurons) and deeper (more layers) is like stacking more books—you need rules for stronger brackets to keep everything stable.

🥬 The Concept: Maximal Update Parametrization (μP)

What it is: μP is a way to choose hyperparameters so that a model learns in a steady, size-independent way even as you make it bigger.
How it works:
1. Watch the size of features (the signals moving through layers) and keep them from blowing up or vanishing.
2. Scale the learning rates and initial weights with model size so that each training step changes features by about the same amount, no matter the size.
3. Prefer updates that most effectively change useful features (maximize impact without instability).
Why it matters: Without μP, training very large models becomes unstable, and the best hyperparameters for a small model won’t work for a big one, wasting tons of compute on retuning.
🍞 Anchor: When you upgrade from a tiny GPT to a much bigger one, μP aims to let you keep almost the same “recipe” for learning rate and initialization—and have training behave similarly.

🍞 Hook: Imagine a video game character walking on stepping stones across a river. If the stones get too far apart (deeper network) or too wide (wider network) without the right balance, the character slips.

🥬 The Concept: Residual Networks

What it is: Residual networks add “skip paths” so information can flow straight ahead as well as through the layer’s transform.
How it works:
1. Each block does: output = input + smal $l_c$ hange.
2. The “smal $l_c$ hange” is controlled by a multiplier (α) so adding many blocks doesn’t explode the signal.
3. These skip paths make deep models trainable.
Why it matters: Without residual paths and the right α, deeper models either drown in noise or freeze and stop learning.
🍞 Anchor: In Transformers, residual connections let the model stack many attention and MLP blocks without the signal fading away.

🍞 Hook: Think of resizing pizza dough. Making it wider (bigger diameter) and thicker (deeper crust) at the same time needs a new recipe, not just more flour.

🥬 The Concept: Width–Depth Scaling

What it is: Growing a model in both width (neurons per layer) and depth (number of layers) together.
How it works:
1. Decide how width n and depth L increase.
2. Adjust initialization and learning rates so signals stay steady across many layers.
3. Account for accumulation: more layers means more small changes add up.
Why it matters: If you only handle width or only depth, the other can still break training.
🍞 Anchor: GPT family models typically scale both embedding size (width) and number of blocks (depth); the training recipe must respect both.

🍞 Hook: You know how a cookie recipe that works for 6 cookies won’t work for 600 without changing oven time and ingredient amounts?

🥬 The Concept: Hyperparameter (HP) Transferability

What it is: The ability to reuse tuned hyperparameters from a small model on a big one and get similar performance.
How it works:
1. Choose size-aware scaling of learning rates, initialization, and multipliers.
2. Keep feature sizes and update sizes nearly invariant as you scale.
3. Then the “best base settings” transfer smoothly.
Why it matters: Retuning HPs for huge models is extremely expensive in time and money.
🍞 Anchor: Tune LR once on a 256-wide, 4-layer model; apply a formula to reuse it on a 4096-wide, 64-layer model.

The world before: Researchers had μP rules that worked well when only width grew. But modern models (like GPT-2/3/4 style) scale both width and depth. People tried to extend μP using heavy mathematical tools (like Tensor Programs and mean-field theory), but the results were scattered: one rule for SGD here, a different rule for AdamW there, and special tweaks for Transformers versus other nets. That made it hard for practitioners to know what to do and for researchers to generalize to new optimizers.

The problem: As models grow in both directions, two bad things happen: feature norms drift (blow up or vanish) and the “best” learning rate changes unpredictably. That breaks HP transfer, wasting compute on retuning. Also, different residual block designs (one-layer vs two-layer internals) seemed to require different depth scalings (1/√L vs 1/L), which was confusing.

Failed attempts: Architecture-specific formulas fixed one model but didn’t generalize. Optimizer-specific tweaks worked for AdamW but not for Muon or Shampoo. Theory-heavy analyses were hard to reproduce and extend.

The gap: We needed a single, simple, optimizer-agnostic rule that tells us exactly how the sizes (spectral norms) of weights and updates must scale with width and depth—especially in residual networks—and that recovers all those scattered results as special cases.

Real stakes: Stable training and HP transfer save huge amounts of compute and time in real projects—language models, text-to-image, and more. Without a simple rule, teams either overpay for tuning or accept worse models. With it, they can scale confidently.

02Core Idea

🍞 Hook: Imagine building a long train. Each car (layer) adds a tiny push. If each push is too big, the train speeds out of control; too small, it barely moves. There’s a “just-right” push size that depends on how many cars you have.

🥬 The Concept: Spectral μP Condition

What it is: A single rule that says how big weight matrices and their per-step updates should be—measured by RMS operator norm—so features stay steady while learning stays strong, even as both width and depth grow.
How it works (two parts):
1. Initialization: Choose sizes of weights so the signal after each residual block remains O(1). For two-layer residual blocks, the product of the two sublayer norms times the block multiplier α must be Θ(1/L).
2. Updates: Choose learning rates so each block’s first-order (and, when present, second-order) update contributions sum to Θ(1) across all L blocks, i.e., each block contributes about 1/L.
Why it matters: Without this, residual additions stack up and explode; with it, they add like balanced tiny pushes.
🍞 Anchor: If your model has L=100 blocks, set each block’s effective change to about 1/100 so the total change is steady and meaningful.

Three analogies for the same idea:

Water faucets: You have 100 faucets (blocks) filling a bucket (feature change). If each faucet runs full blast, the bucket overflows. The rule says: set each faucet to 1/100 so together they fill the bucket to the perfect level.
Orchestra: 100 violins (blocks). If each plays loudly, it’s noise; if each plays softly at 1/100 volume, together you get a rich, balanced sound.
Road trip: 100 legs of a journey. If each leg is too long, you get exhausted; too short, you don’t arrive. The rule sets each leg’s length so the total is just right.

🍞 Hook: You know how a ruler lets you compare lengths no matter the object? We need a consistent ruler for weight sizes.

🥬 The Concept: RMS Operator Norm (Spectral Size)

What it is: A way to measure the size of a matrix transformation fairly across different widths and depths.
How it works:
1. It looks at how much a matrix can stretch a vector, but normalized by dimensions (RMS).
2. It plays nicely with addition and multiplication (subadditivity and submultiplicativity), letting us bound feature sizes across layers.
3. Random matrix facts tell us what sizes to expect at initialization.
Why it matters: If you don’t measure sizes consistently, your scaling rules won’t transfer.
🍞 Anchor: Think of RMS norm as “one-size-fits-all” shoe sizing for matrices—fair across foot sizes.

Before vs after:

Before: Different μP prescriptions for SGD vs AdamW, one-layer vs two-layer blocks, complex to derive.
After: One spectral condition that recovers all cases: one-layer blocks imply α ~ 1/√L; two-layer blocks (and beyond) imply α ~ 1/L because second-order update terms appear and must also be controlled.

Why it works (intuition):

Residual networks add up many small changes. By linear algebra, the total size of these changes can be bounded using the norms of the parts. If each block contributes about 1/L, then L blocks sum to O(1): stable but not trivial.
In two-layer blocks, updates from the two sublayers can multiply, creating “second-order” terms. That extra multiplication means each block must be even smaller (1/L instead of 1/√L) to keep the total controlled and maximally useful.

🍞 Hook: Imagine a universal adapter that lets you plug any device into any outlet.

🥬 The Concept: Unified Mapping to Hyperparameters Across Optimizers

What it is: A recipe to translate the spectral condition into concrete choices of learning rate, initialization scale, and residual multipliers for many optimizers.
How it works:
1. Compute how big an optimizer’s one-step update is in RMS norm.
2. Set learning rates so each block’s contribution is ~1/L.
3. Adjust α and initialization so feature norms stay O(1).
Why it matters: No need for ad hoc tuning per optimizer; you follow the same logic everywhere.
🍞 Anchor: For Muon-Kimi, the paper shows η_hidden should shrink like 1/√n so that updates land at the right size.

Building blocks of the idea:

Scale-invariant features: keep ∥ $h_l$ ∥_R and ∥Δ $h_l$ ∥_R around constant size.
Residual multiplier α: for two-layer blocks, α ~ 1/L; for one-layer, α ~ 1/√L.
First- vs second-order terms: first-order (single updated sublayer) and second-order (two updated sublayers) both must total to O(1) across L blocks.
Spectral initialization: choose weight variances so matrix norms are predictable and size-aware.
Optimizer mapping: compute ∥ΔW∥_R under each optimizer and pick η so α ∥ΔW∥_R matches the 1/L target per block.

Together, these pieces give a simple, math-light, unified rule that keeps training stable and makes HPs transferable as models grow.

03Methodology

At a high level: Input → Spectral initialization (size the weights) → Residual scaling (choose α per block) → Optimizer mapping (choose learning rates so per-block updates are 1/L) → Output with stable features and strong learning.

🍞 Hook: Think of following a cooking recipe where each ingredient scales with how many guests you have. If you double the guests (depth) and widen the table (width), you scale salt, flour, and oven time precisely, or the dish fails.

🥬 The Concept: Two-Layer Residual Block (the minimal realistic block)

What it is: A residual block whose “main branch” has two linear layers W(1) and W(2), added to a skip connection.
How it works:
1. $h_l$ = $h_{l-1}$ + α_l W(2)_l W(1)_l $h_{l-1}$ .
2. α_l controls how much the block changes the signal.
3. Two sublayers mean update terms can multiply (second-order effects).
Why it matters: Transformers’ MLP/attention blocks behave like this, so this is the right minimal model to capture real scaling.
🍞 Anchor: Picture two mini-steps inside each block: stretch then rotate (W(1), W(2)), then add back the original vector (skip connection).

Step A: Spectral initialization so features stay O(1)

Goal: Ensure ∥ $h_l$ ∥_ $R ≈ constant$ for all l, even as L grows.
Rule for input/output layers: α_in ∥ $W_i$ n∥_R = Θ(1), α_out ∥ $W_o$ ut∥_R = Θ(1).
Rule for hidden two-layer blocks: α_l ∥W(2)_l∥_R ∥W(1)_l∥_R = Θ(1/L).
Why this step exists: If the product doesn’t shrink with L, the L residual additions make features grow too big.
Example with numbers: Suppose L=100 and at init ∥W(1)l∥R ≈ ∥W(2)l∥ $R ≈ 1$ . Then set α $l ≈ 1$ /100 so α $l × 1$ × $1 ≈ 1$ /100.

Step B: Update sizing so per-step feature change is O(1)

Expand one training step and collect terms: • Zero-order: propagation of earlier changes (doesn’t depend on current block’s ΔW). • First-order: one of W(1), W(2) is updated (appears once). • Second-order: both W(1) and W(2) updated (their product appears).
Target: The sum over L blocks of each class should be Θ(1), so each block contributes about 1/L.
Conditions: • Input/output: α_in ∥Δ $W_i$ n∥_R = Θ(1), α_out ∥Δ $W_o$ ut∥_R = Θ(1). • Hidden first-order: α_l ∥ΔW(2)_l∥_R ∥W(1)_l∥_R = Θ(1/L) and α_l ∥W(2)_l∥_R ∥ΔW(1)_l∥_R = Θ(1/L). • Hidden second-order: α_l ∥ΔW(2)_l∥_R ∥ΔW(1)_l∥_R = Θ(1/L).
Why this step exists: Without it, per-step changes either blow up (too big) or vanish (too small), breaking learning.
Example: With L=100 and α_l=1/100, if ∥W(1)l∥ $R≈1$ , we choose ∥ΔW(2)l∥ $R≈1$ so α_ $l × 1$ × $1 ≈ 1$ /100; summing over 100 blocks $gives ≈1$ total.

🍞 Hook: If you know how powerful your wrench is (optimizer), you choose how hard to turn (learning rate) so each bolt (block) tightens just enough.

🥬 The Concept: Mapping to Optimizer Hyperparameters (Muon-Kimi shown; others analogous)

What it is: Turn the spectral targets into learning rate schedules per layer type.
How it works for Muon-Kimi:
1. Compute the RMS size of one step: ∥ΔW∥_R ≈ Θ(η × siz $e_f$ actor( $n_i$ n, $n_o$ ut)).
2. For hidden matrices in residual blocks, siz $e_f$ actor ≈ √ $n_i$ n, so to keep ∥ΔW∥R ≈ Θ(1), set η $hidden ≈ 1$ /√n.
3. For input/output layers, set η ≈ constant so α_in ∥Δ $W_i$ n∥_R and α_out ∥Δ $W_o$ ut∥_R are Θ(1).
Why it matters: This directly yields ratio rules like η_hidden ∝ 1/√(width).
🍞 Anchor: Double the width? Then shrink the hidden-layer LR by about 1/√2 to keep updates the same size.

Step C: Ratio-based scaling for easy transfer

Define width ratio $r_n$ = n / $n_b$ ase and depth ratio $r_L$ = L / $L_b$ ase.
Set block multiplier α_hidden = α_base / $r_L$ (two-layer blocks → 1/L).
Set init variances so hidden ∥W∥_ $R ≈ 1$ ; set output appropriately.
Set η_hidden = η_base / √ $r_n$ ; keep η_in, η_out ≈ η_base.
Why this step exists: It lets you tune once on a small base model and transfer to any target size by simple formulas.
Example: Base (n=256, L=4) to target (n=1024, L=64): $r_n$ =4, $r_L$ =16. Then α_hidden scales by 1/16; η_hidden scales by 1/√4=1/2.

Secret sauce (what makes it clever):

Use RMS operator norms and simple inequalities (subadditivity/multiplicativity) to set tight, easy-to-check bounds.
Recognize that second-order terms appear only when blocks $have ≥2$ sublayers; this explains the jump from 1/√L to 1/L.
Turn abstract constraints into plug-and-play HP rules for many optimizers, unifying old results as special cases.

Concrete data walkthrough:

Suppose L=100, n=4096. At init, hidden ∥W(1)∥R ≈ ∥W(2)∥ $R ≈ 1$ , so choose α_ $l≈1$ /100.
Muon-Kimi hidden step: ∥ΔW∥R≈η√n ≈ η×64. To hit ∥ΔW∥ $R≈1$ , use η≈1/ $64≈0$ .0156 (times a constant from the optimizer).
Summing over L blocks: each $contributes ≈1$ /100; $total ≈1$ across the network, giving stable yet meaningful feature updates.

🍞 Anchor: Like cooking for a crowd, you multiply each ingredient by the guest ratio—but also adjust oven time (depth) so the center cooks right. These formulas are that adjustment.

04Experiments & Results

🍞 Hook: Think of testing a trampoline. You want to see if it bounces the same no matter how many kids are on it (depth) or how wide it is. If your setup is right, the bounce (feature size) stays in a safe, steady range, and one set of instructions works for any crowd size.

🥬 The Concept: Feature-Norm Stability Test

What it is: A quick “coordinate check” that measures the RMS size of features at a specific point in the network as you scale width and depth.
How it works:
1. Train for a few steps while increasing width n or depth L.
2. Measure ∥ $h_L$ ∥_R (RMS of the last block’s output) and see if it stays about the same.
3. Compare μP vs standard parameterization (SP).
Why it matters: If ∥ $h_L$ ∥_R balloons or collapses with scale under SP, but stays steady under μP, your scaling rule is working.
🍞 Anchor: It’s like checking the water level in a fountain: if it shoots too high or too low as you add pumps (layers), you know the control valves (α, η) need fixing.

Setup:

Models: GPT-2–style Transformers.
Data: OpenWebText, sequence length 1024.
Optimizers: Hidden matrix params with Muon-Kimi; others (biases, embeddings) with AdamW.
Base model: width $n_b$ ase=256, depth $L_b$ ase=4.
Scaling: Use spectral μP rules—α_hidden ∝ 1/L, η_hidden ∝ 1/√n—and compare to SP.

The tests and why they matter:

Feature learning stability (10 steps): This isolates the immediate dynamics. If μP holds, ∥ $h_L$ ∥_R should be roughly constant as n or L grows; SP typically drifts.
HP transfer (300M tokens): Tune a base learning rate once, then apply the ratio formulas to bigger models. If validation loss is best at nearly the same base LR across sizes under μP (but not under SP), transfer works.

Competition (baselines):

SP (standard parameterization) with Muon-Kimi and AdamW.
μP (this paper’s spectral rules) with the same optimizers.

Scoreboard with context:

Feature norms: Under SP, ∥ $h_L$ ∥_R grows fast with width and depth—like a trampoline that bounces higher and higher, risking a spill. Under μP, ∥ $h_L$ ∥_R stays almost flat—safe, predictable bounces.
Learning-rate transfer: Under SP, the “best” base LR shifts significantly when you change width—like needing a new oven temperature every time you bake a larger batch. Under μP, the best base LR stays almost the same across both width and depth—set it once, scale by the ratios, and you’re good.
Loss: As width and depth increase, μP consistently gets lower validation loss than SP. Think of it as μP scoring an A while SP gets a B- when class size increases.

Surprising findings:

SP can look okay across small depth ranges if stabilizers like LayerNorm are present—like training wheels hiding a wobbly bike. But when you remove such stabilizers or go to larger depths, SP breaks, while μP remains stable even up to L=256.
The spectral μP rules derived from a simplified linear analysis still work well on real GPT-2–style models, supporting the idea that the core scaling behavior is captured by these simple spectral principles.

Takeaway from experiments:

μP via the spectral condition turns size scaling from a guessing game into a reliable, repeatable process. It stabilizes features, preserves learning-rate optima across sizes, and improves loss as models grow, compared to SP.

🍞 Anchor: It’s like having a musical score that keeps a 10-person choir and a 100-person choir perfectly in tune with the same conductor cues. μP is those cues for scaling models.

05Discussion & Limitations

🍞 Hook: Even the best bike needs a helmet and a map. This method is powerful, but it’s important to know where it shines and where you must be careful.

🥬 The Concept: Limits and Assumptions

What it is: Clear boundaries on when the spectral μP framework applies as-is.
How it works:
1. The theory is derived for residual networks with fixed-size internal blocks (two-layer analysis shown; extends to k-layer blocks).
2. It uses RMS operator norms and random-initialization assumptions; it’s most accurate near initialization and early dynamics.
3. Extensions to nonlinearities, multiple steps, and datasets hold under mild assumptions and are empirically supported but still approximations.
Why it matters: Knowing the map helps you avoid overpromising or misusing the rules in unusual architectures.
🍞 Anchor: If you switch from highways to mountain trails (very different architectures), you may need new tires (a recheck of assumptions).

Limitations:

Non-residual architectures aren’t directly covered; you’d need an analogous decomposition and may not get the same α ~ 1/L rule.
Extremely deep blocks with changing internal widths or exotic components could require re-deriving size factors.
The analysis focuses on per-step scaling; aggressive schedules, long training with heavy momentum, or extreme regularization may need careful validation.

Required resources:

You need to measure or estimate update magnitudes under your optimizer (often available from known formulas or quick probes).
Implementation support to set α, initialization scales, and per-parameter-group learning rates.

When not to use (or to adapt first):

Non-residual stacks without skip connections (accumulation behaves differently).
Architectures where “block” boundaries are unclear or where updates are dominated by non-matrix ops.
Cases with strong external stabilizers (e.g., aggressive normalization) that already fix scaling in a different way; re-check interactions.

Open questions:

Best practices for very long training with strong momentum or adaptive schedules under spectral μP.
How to optimally combine spectral μP with techniques like weight decay, gradient clipping, and low-rank adapters at massive scales.
Extending the framework to sequence-length scaling and mixture-of-experts routing, where “effective depth” can change dynamically.

Overall, the framework is robust, simple, and unifies prior results, but thoughtful application remains important when you step far outside the analyzed setting.

06Conclusion & Future Work

Three-sentence summary:

This paper provides a simple, unified spectral condition that tells us exactly how big weights and their updates should be in residual networks when scaling both width and depth.
It explains why two-layer residual blocks require α ~ 1/L (and one-layer blocks α ~ 1/√L), unifying scattered μP results and mapping cleanly to practical hyperparameters across many optimizers.
Experiments on GPT-2–style models show the method stabilizes feature sizes, enables hyperparameter transfer, and improves loss compared to standard parameterization.

Main achievement: Turning a complex, optimizer- and architecture-specific problem into a single spectral rule with elementary linear algebra, then converting that rule into plug-and-play hyperparameter formulas that work in practice.

Future directions:

Extend the spectral μP toolkit to more architectures (e.g., attention variants, state-space models, MoE) and to training regimes with strong momentum or long schedules.
Co-design with normalization and regularization (LayerNorm, weight decay) to get the best of both worlds.
Explore sequence-length and data-distribution scaling under the same spectral lens.

Why remember this: It’s the “golden ratio” for width–depth scaling in residual networks—simple to state, easy to use, and powerful enough to unify past recipes and guide training of tomorrow’s larger models.

Practical Applications

•Tune learning rates once on a small base model and transfer them to larger models using the provided width/depth ratios.
•Set residual multipliers α_l to 1/L for two-layer (and deeper) residual blocks to prevent feature explosion.
•Choose initialization variances so hidden-layer RMS operator norms are O(1), enabling stable starts.
•For Muon-Kimi, set hidden learning rates proportional to 1/√width to hit the right update sizes.
•Use the feature-norm coordinate check (measure ∥h_L∥_R over 10 steps) to validate correct scaling before long runs.
•Apply the spectral mapping to other optimizers (e.g., SGD, AdamW, Shampoo) by computing their per-step ∥ΔW∥_R.
•Adopt ratio-based HP configs (r_n, r_L) in training pipelines to automate HP transfer across model families.
•When changing block depth (one-layer to two-layer), switch α scaling from 1/√L to 1/L to maintain stability.
•Combine spectral μP with standard schedules (warmup + cosine decay) after verifying early-step stability.
•Use the framework to set consistent weight decay magnitudes by matching decay-driven and gradient-driven update scales.

Version: 1