📚TheoryIntermediate

Spectral Regularization

Key Points

•
Spectral regularization controls how much a weight matrix can stretch inputs by constraining its largest singular value (spectral norm).
•
It stabilizes training, limits exploding activations/gradients, and provides Lipschitz control for robustness and better generalization.
•
Two common approaches are adding a spectral-norm penalty to the loss or projecting/scaling weights to keep the spectral norm below a threshold.
•
The spectral norm can be efficiently approximated with power iteration without computing a full SVD.
•
In practice, spectral normalization rescales W by its estimated top singular value at each step to enforce a target Lipschitz constant.
•
Compared with L2 weight decay (Frobenius norm), spectral regularization directly limits the worst-case amplification rather than average energy.
•
Gradients for a spectral penalty can be approximated using the top singular vectors u and v (a subgradient is u $v^{T}$ ).
•
Compute-accuracy trade-offs depend on the number of power iterations; a small fixed number (1–5) is often sufficient in large models.

Prerequisites

→Linear algebra fundamentals — Understanding matrices, vectors, norms, and matrix–vector multiplication is essential to reason about spectral properties.
→Eigenvalues, singular values, and SVD — Spectral regularization is defined in terms of singular values; knowing SVD clarifies spectral norm and its vectors.
→Gradient-based optimization — Implementing penalties or projections requires computing gradients and performing iterative updates.
→Regularization in machine learning — Comparing spectral penalties with L2 weight decay and understanding bias–variance trade-offs helps in choosing methods.
→Lipschitz continuity and robustness — The main motivation is bounding the Lipschitz constant; this concept explains stability and robustness benefits.
→Iterative numerical methods — Power iteration is an iterative approximation; understanding convergence and warm starts is useful.

Detailed Explanation

Tap terms for definitions

01Overview

Spectral regularization is a family of techniques that constrain the spectral properties of weight matrices, most notably their spectral norm (largest singular value). In simple terms, it limits how much a linear transformation W can stretch an input vector. This limit is directly tied to the Lipschitz constant of the transformation and therefore to training stability, robustness, and generalization. Spectral regularization can be implemented either as a penalty added to the loss function or as an explicit constraint via projection or normalization of weights. The penalty approach augments the objective with a term like (\lambda \lVert W \rVert_2) or (\lambda \lVert W \rVert_2^2), while the constraint approach rescales or projects W so its spectral norm does not exceed a target value c. Because computing exact singular values is expensive, practical implementations rely on power iteration to estimate the top singular value and its vectors efficiently. Spectral regularization is widely used in deep learning for stabilizing training of recurrent networks, controlling the Lipschitz constant in Wasserstein GAN critics, and improving robustness against adversarial perturbations, all by ensuring that layers do not excessively amplify inputs or gradients.

02Intuition & Analogies

Imagine your model’s weight matrix as a stretchy fabric. Pull it in different directions and it stretches by different amounts. The direction where it stretches the most is the most dangerous: a tiny input change along that direction can become a big output change. Spectral regularization is like setting a maximum stretch limit on the fabric so no direction can get outrageously amplified. Another analogy is a volume knob on an amplifier. The spectral norm is the highest volume the amplifier can reach for any song (input vector). If that maximum is too high, some songs will sound distorted or blow the speakers; capping the knob keeps everything under control. In optimization terms, this cap is the Lipschitz constant: it bounds how much the output can change when the input changes. Penalty-based spectral regularization is like charging a fee the louder you go, encouraging the system to keep volumes moderate. Projection- or normalization-based methods are like an automatic limiter that turns the volume down whenever it exceeds a threshold. Compared to traditional L2 weight decay (which punishes overall weight energy), spectral regularization targets the single worst-case direction of amplification. This is why it’s particularly effective for preventing exploding activations in deep or recurrent networks and for enforcing smoothness in models that must be 1-Lipschitz, such as WGAN critics.

03Formal Definition

For a matrix \(W

\in

R^{m \times n}

\), the spectral norm (operator 2-norm) is defined as \(

∥

∥_{2}

x \neq = 0 max

\frac{∥ W x ∥ _{2}}{∥ x ∥ _{2}}

σ_{m a x}

(W)\), where \(

σ_{m a x}

(W)\) is the largest singular value of W. Equivalently, \(

∥

∥_{2}

λ_{m a x} (W^{⊤} W)

\). The Lipschitz constant of the linear map \(x

\mapsto

Wx\) under the Euclidean norm equals \(

∥

∥_{2}

\). Spectral regularization augments an empirical risk \(

L_{data}

(

θ

)\) with a spectral penalty on each weight matrix W in the model: \(

L

(

θ

) =

L_{data}

(

θ

) +

ℓ \sum

λ

ℓ

R(W_

ℓ

)\), where $R(W)$ is commonly \(

∥

∥_{2}

\) or \(

∥

∥_{2}^{2}

\). A constraint-based variant enforces \(

∥

∥_{2}

\leq

c\) by projection: after each update, replace W by \(

Π_{∥ \cdot ∥_{2} \leq c}

(W)\), which for \(

∥

∥_{2}

> c\) equals \(

\frac{c}{∥ W ∥ _{2}}

W\). In practice, \(

∥

∥_{2}

\) and the top singular vectors $u, v$ are approximated via power iteration: repeatedly update \(u

\leftarrow

\frac{W v}{∥ W v ∥ _{2}}

\) and \(v

\leftarrow

\frac{W ^{⊤} u}{∥ W ^{⊤} u ∥ _{2}}

\). When the top singular value is simple, a subgradient of \(

∥

∥_{2}

\) is \(uv^

⊤

\).

04When to Use

Use spectral regularization when you need tight control of the worst-case amplification of a layer or model. Typical scenarios include: (1) Stabilizing deep or recurrent networks where activations or gradients can explode; bounding (\lVert W \rVert_2) directly limits the step-to-step growth. (2) Enforcing Lipschitz constraints for models such as WGAN critics, where 1-Lipschitzness is required; spectral normalization achieves this by rescaling weights. (3) Improving robustness to input noise and adversarial perturbations by limiting how sensitive outputs can be to small input changes. (4) Encouraging better generalization by reducing the capacity for sharp decision boundaries that depend on extreme amplifications. Choose a penalty when you want a soft preference for small spectral norms that the optimizer can trade off against data fit. Choose projection/normalization when you must obey a hard limit (e.g., (\lVert W \rVert_2 \le c)) at all times. In large-scale training, prefer approximate methods (1–5 power iterations per step) to keep overhead manageable, and consider reusing the previous singular vectors as warm starts to reduce iterations.

⚠️Common Mistakes

Common pitfalls include: (1) Computing an exact SVD at every update, which is computationally prohibitive; use power iteration to approximate the top singular value and vectors. (2) Using too few or too many power iterations; too few yields a biased, under-estimated norm (weak constraint), while too many waste compute. Start with 1–3 and warm start from the previous iteration’s vectors. (3) Confusing Frobenius norm regularization with spectral norm regularization; L2 weight decay reduces total weight energy but does not bound worst-case amplification. (4) Forgetting to stop gradients through the normalization step in frameworks that require it; in pure C++ you must conceptually separate the rescaling from gradient computation or be aware that the normalization changes the effective parameterization. (5) Ignoring non-differentiability when the top singular value is not unique; the subgradient (uv^\top) is valid only when the top singular value is simple—otherwise, any convex combination of outer products of top singular vectors is a subgradient. (6) Applying the constraint inconsistently across layers with different scales; if you cap some layers but not others, activations may simply shift amplification. (7) Setting c too low and bottlenecking learning; monitor training loss and gradient norms to tune c or (\lambda). (8) Failing to renormalize after optimizer steps (for projection methods); always project after each update to maintain the constraint.

Key Formulas

Spectral Norm Definition

∥ W ∥_{2} = x \neq = 0 max \frac{∥ W x ∥ _{2}}{∥ x ∥ _{2}} = σ_{m a x} (W)

Explanation: The spectral norm is the maximum amplification factor of W over all directions x. It equals the largest singular value of W.

Spectral Norm via Eigenvalues

∥ W ∥_{2} = λ_{m a x} (W^{⊤} W)

Explanation: The square of the spectral norm is the largest eigenvalue of $W^{T}$ W. This connects singular values to eigenvalues of a symmetric matrix.

Lipschitz Constant

L (f) = x \neq = y sup \frac{∥ f ( x ) - f ( y ) ∥ _{2}}{∥ x - y ∥ _{2}}, L (x \mapsto W x) = ∥ W ∥_{2}

Explanation: The Lipschitz constant bounds how much outputs can change per unit input change. For a linear map, it equals the spectral norm.

Regularized Objective

L (θ) = L_{data} (θ) + ℓ \sum λ_{ℓ} R (W_{ℓ}), R (W) \in {∥ W ∥_{2}, ∥ W ∥_{2}^{2}}

Explanation: Training can penalize large spectral norms by adding a regularization term to the data loss. The penalty can be linear or quadratic in the spectral norm.

Spectral Normalization

W_{norm} = c \frac{W}{σ _{m a x} ( W )}

Explanation: Rescale W so that its spectral norm is exactly c. This enforces a desired Lipschitz bound per layer.

Power Iteration Updates

u_{t + 1} = \frac{W v _{t}}{∥ W v _{t} ∥ _{2}}, v_{t + 1} = \frac{W ^{⊤} u _{t + 1}}{∥ W ^{⊤} u _{t + 1} ∥ _{2}}

Explanation: Alternating between u and v converges to the top singular vectors. The estimate of the top singular value is $u^{T}$ W v.

Subgradient of Spectral Norm

\partial ∥ W ∥_{2} ∋ u^{*} (v^{*})^{⊤}

Explanation: When the largest singular value is unique, the outer product of its singular vectors is a valid subgradient. It guides gradient-based optimization with a spectral penalty.

Norm Inequalities

∥ W ∥_{2} \leq ∥ W ∥_{F} \leq rank (W) ∥ W ∥_{2}

Explanation: The Frobenius norm is an upper bound on the spectral norm. These inequalities relate average energy to worst-case amplification.

Nuclear Norm

∥ W ∥_{*} = i = 1 \sum m i n (m, n) σ_{i} (W)

Explanation: The nuclear norm sums all singular values. It is often used to encourage low rank, unlike spectral regularization which targets only the largest singular value.

Condition Number

κ_{2} (W) = \frac{σ _{m a x} ( W )}{σ _{m i n} ( W )}

Explanation: A high condition number indicates sensitivity to perturbations. Reducing $ $σ_{m a x}$ $ helps if $ $σ_{m i n}$ $ is fixed.

Projected Gradient Step

W^{+} = Π_{∥ \cdot ∥_{2} \leq c} (W - η \nabla L (W)) = ⎩ ⎨ ⎧ W - η \nabla L (W), \frac{c}{∥ W - η \nabla L ( W ) ∥ _{2}} (W - η \nabla L (W)), ∥ W - η \nabla L (W) ∥_{2} \leq c otherwise

Explanation: After a gradient step, project back to the spectral-norm ball of radius c. For this ball, projection is a simple rescaling if the norm is too large.

Singular Value Estimate

\overset{σ}{^} = u^{⊤} W v

Explanation: Given approximate top singular vectors u and v, the Rayleigh quotient $u^{T}$ W v estimates the largest singular value. It is cheap to compute once u and v are known.

Complexity Analysis

The dominant computational cost in spectral regularization is estimating the spectral norm of each weight matrix. Using k steps of power iteration on an m×n matrix requires O(k m n) time: each iteration performs one matrix–vector multiply (Wv) costing O(m n) and one transpose multiply (

W^{T}

u) costing O(m n). Memory usage is O(m n) to store W plus O(m + n) for the working vectors u and v; the extra space is negligible compared to the matrix itself. In a neural network with L layers, applying spectral penalties or projections to all layers multiplies this overhead by L. When training with batches, the spectral-norm computation is independent of the batch size, so the relative overhead is smaller for larger batches. Penalty-based methods add the cost of computing a subgradient, which is simply the outer product u

v^{T}

in O(m n) time and space to accumulate into the weight gradient. Projection/normalization methods add one rescaling of W, also O(m n). In practice, k is kept small (e.g.,

k = 1-5

) and warm-started from the previous iteration’s vectors so that the incremental cost per training step remains modest. Exact SVD would cost O(min(m

n^{2}

, n

m^{2}

)) per layer, which is typically prohibitive for large matrices. Therefore, approximate methods are preferred for scalability, trading a small bias in the norm estimate for substantial speed. Overall, with small k, the amortized overhead per step is linear in the number of parameters of each constrained matrix.

Code Examples

Estimating the spectral norm with power iteration (u, v, and σmax)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 using Matrix = vector<vector<double>>;
5 using Vector = vector<double>;
6 
7 // Multiply matrix W (m x n) by vector x (n)
8 Vector matVec(const Matrix& W, const Vector& x) {
9     size_t m = W.size();
10     size_t n = W[0].size();
11     Vector y(m, 0.0);
12     for (size_t i = 0; i < m; ++i) {
13         double s = 0.0;
14         for (size_t j = 0; j < n; ++j) s += W[i][j] * x[j];
15         y[i] = s;
16     }
17     return y;
18 }
19 
20 // Multiply W^T (n x m) by vector u (m)
21 Vector matTVec(const Matrix& W, const Vector& u) {
22     size_t m = W.size();
23     size_t n = W[0].size();
24     Vector v(n, 0.0);
25     for (size_t j = 0; j < n; ++j) {
26         double s = 0.0;
27         for (size_t i = 0; i < m; ++i) s += W[i][j] * u[i];
28         v[j] = s;
29     }
30     return v;
31 }
32 
33 double dot(const Vector& a, const Vector& b) {
34     double s = 0.0; for (size_t i = 0; i < a.size(); ++i) s += a[i] * b[i]; return s;
35 }
36 
37 double norm2(const Vector& x) {
38     return sqrt(max(0.0, dot(x, x)));
39 }
40 
41 void normalize(Vector& x) {
42     double n = norm2(x);
43     if (n < 1e-12) return; // avoid divide by zero
44     for (double& xi : x) xi /= n;
45 }
46 
47 // Power iteration to approximate top singular value and vectors
48 // Returns estimated sigma; outputs u and v if provided
49 double spectralNormPowerIteration(const Matrix& W, int iters, Vector* u_out, Vector* v_out) {
50     size_t m = W.size();
51     size_t n = W[0].size();
52     // Initialize v with random values
53     std::mt19937 rng(42);
54     std::normal_distribution<double> dist(0.0, 1.0);
55     Vector v(n);
56     for (size_t j = 0; j < n; ++j) v[j] = dist(rng);
57     normalize(v);
58 
59     Vector u(m, 0.0);
60     for (int t = 0; t < iters; ++t) {
61         u = matVec(W, v);
62         normalize(u);
63         v = matTVec(W, u);
64         normalize(v);
65     }
66     // Rayleigh quotient estimate of sigma
67     Vector Wv = matVec(W, v);
68     double sigma = norm2(Wv); // equivalent to u^T W v when u~Wv/||Wv||
69 
70     if (u_out) *u_out = u;
71     if (v_out) *v_out = v;
72     return sigma;
73 }
74 
75 int main() {
76     // Example matrix (m x n)
77     size_t m = 4, n = 3;
78     Matrix W(m, vector<double>(n, 0.0));
79     // Fill W with some values
80     double val = 0.1;
81     for (size_t i = 0; i < m; ++i)
82         for (size_t j = 0; j < n; ++j)
83             W[i][j] = val += 0.17; // arbitrary nontrivial entries
84 
85     Vector u, v;
86     double sigma = spectralNormPowerIteration(W, 20, &u, &v);
87 
88     cout << fixed << setprecision(6);
89     cout << "Estimated spectral norm (sigma_max): " << sigma << "\n";
90     cout << "u (top left singular vector): "; for (double x : u) cout << x << ' '; cout << "\n";
91     cout << "v (top right singular vector): "; for (double x : v) cout << x << ' '; cout << "\n";
92     return 0;
93 }
94

This program estimates the largest singular value (spectral norm) and its corresponding singular vectors using power iteration. It alternates normalizing Wv and W^T u. The final estimate of σ_max is ||W v||_2, which equals u^T W v when u is aligned with Wv. This avoids the cost of a full SVD.

Time: O(k m n) for k iterations on an m×n matrixSpace: O(m n) for W and O(m + n) extra for u, v, and temporaries

Linear regression with spectral-norm penalty using power-iteration subgradient

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 using Matrix = vector<vector<double>>;
5 using Vector = vector<double>;
6 
7 Vector matVec(const Matrix& W, const Vector& x) {
8     size_t m = W.size(); size_t n = W[0].size(); Vector y(m, 0.0);
9     for (size_t i = 0; i < m; ++i) { double s = 0.0; for (size_t j = 0; j < n; ++j) s += W[i][j]*x[j]; y[i]=s; } return y;
10 }
11 Vector matTVec(const Matrix& W, const Vector& u) {
12     size_t m = W.size(); size_t n = W[0].size(); Vector v(n, 0.0);
13     for (size_t j = 0; j < n; ++j) { double s = 0.0; for (size_t i = 0; i < m; ++i) s += W[i][j]*u[i]; v[j]=s; } return v;
14 }
15 Matrix outer(const Vector& a, const Vector& b) {
16     size_t m=a.size(), n=b.size(); Matrix M(m, vector<double>(n));
17     for (size_t i=0;i<m;++i) for (size_t j=0;j<n;++j) M[i][j]=a[i]*b[j]; return M;
18 }
19 double dot(const Vector& a, const Vector& b){ double s=0.0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; }
20 double norm2(const Vector& x){ return sqrt(max(0.0, dot(x,x))); }
21 void normalize(Vector& x){ double n=norm2(x); if(n<1e-12) return; for(double& xi:x) xi/=n; }
22 
23 // Power iteration returning sigma and (u, v)
24 double spectralNormUV(const Matrix& W, int iters, Vector& u, Vector& v) {
25     size_t m=W.size(), n=W[0].size();
26     if (v.size()!=n) v.assign(n,0.0);
27     if (u.size()!=m) u.assign(m,0.0);
28     static std::mt19937 rng(123);
29     std::normal_distribution<double> dist(0.0,1.0);
30     if (all_of(v.begin(), v.end(), [](double x){return x==0.0;})) {
31         for (size_t j=0;j<n;++j) v[j]=dist(rng);
32         normalize(v);
33     }
34     for (int t=0;t<iters;++t){
35         u = matVec(W, v); normalize(u);
36         v = matTVec(W, u); normalize(v);
37     }
38     Vector Wv = matVec(W, v);
39     double sigma = norm2(Wv);
40     return sigma;
41 }
42 
43 int main(){
44     ios::sync_with_stdio(false);
45     // Synthetic linear regression: y ~ W x + noise
46     size_t in_dim=5, out_dim=3, n_samples=200;
47     Matrix X(n_samples, Vector(in_dim));
48     Matrix Y(n_samples, Vector(out_dim));
49     std::mt19937 rng(7);
50     std::normal_distribution<double> nx(0.0,1.0), nnoise(0.0,0.1);
51 
52     // True weights
53     Matrix W_true(out_dim, Vector(in_dim));
54     for (size_t i=0;i<out_dim;++i) for (size_t j=0;j<in_dim;++j) W_true[i][j]= (i+1)*0.4 + (j+1)*0.2;
55 
56     // Generate data
57     for (size_t s=0;s<n_samples;++s){
58         for (size_t j=0;j<in_dim;++j) X[s][j]=nx(rng);
59         Vector y = matVec(W_true, X[s]);
60         for (size_t i=0;i<out_dim;++i) Y[s][i] = y[i] + nnoise(rng);
61     }
62 
63     // Model weights (to learn)
64     Matrix W(out_dim, Vector(in_dim, 0.0));
65 
66     double lr=0.05; double lambda=0.1; int epochs=200; int pow_iters=2; // small k with warm starts
67     Vector u(out_dim,0.0), v(in_dim,0.0); // warm-start vectors for power iteration
68 
69     for (int ep=0; ep<epochs; ++ep){
70         // Compute MSE gradient over full dataset
71         Matrix grad(out_dim, Vector(in_dim, 0.0));
72         double mse=0.0;
73         for (size_t s=0;s<n_samples;++s){
74             const Vector& x = X[s];
75             const Vector& y = Y[s];
76             Vector yhat = matVec(W, x);
77             Vector err(out_dim);
78             for (size_t i=0;i<out_dim;++i){ err[i]= yhat[i]-y[i]; mse += err[i]*err[i]; }
79             Matrix g = outer(err, x); // d/dW of (1/2)||Wx - y||^2 = err * x^T
80             for (size_t i=0;i<out_dim;++i)
81                 for (size_t j=0;j<in_dim;++j)
82                     grad[i][j] += g[i][j];
83         }
84         mse /= n_samples; // average per sample (without 1/2 for simplicity)
85         // Average gradient with factor 2 due to d||e||^2 = 2 e de
86         for (size_t i=0;i<out_dim;++i)
87             for (size_t j=0;j<in_dim;++j)
88                 grad[i][j] *= (2.0 / n_samples);
89 
90         // Spectral penalty: lambda * ||W||_2, subgradient approx = lambda * u v^T
91         double sigma = spectralNormUV(W, pow_iters, u, v);
92         Matrix pen = outer(u, v);
93         for (size_t i=0;i<out_dim;++i)
94             for (size_t j=0;j<in_dim;++j)
95                 grad[i][j] += lambda * pen[i][j];
96 
97         // Gradient descent step
98         for (size_t i=0;i<out_dim;++i)
99             for (size_t j=0;j<in_dim;++j)
100                 W[i][j] -= lr * grad[i][j];
101 
102         if ((ep+1)%50==0 || ep==0){
103             cout << "Epoch " << (ep+1) << ": MSE=" << mse
104                  << ", sigma_max(W)~" << sigma << "\n";
105         }
106     }
107 
108     // Report final spectral norm
109     Vector uu, vv; double final_sigma = spectralNormUV(W, 10, uu, vv);
110     cout << fixed << setprecision(6) << "Final estimated sigma_max(W): " << final_sigma << "\n";
111     return 0;
112 }
113

This example learns a linear map W for regression while penalizing its spectral norm. It estimates the top singular vectors via a few power iterations per epoch (warm-started for speed). The subgradient of the penalty is approximated as u v^T and added to the MSE gradient, encouraging a smaller spectral norm while fitting the data.

Time: Per epoch: O(n_samples * in_dim * out_dim) for MSE gradient + O(k * out_dim * in_dim) for power iterationSpace: O(out_dim * in_dim) for W and gradients, plus O(out_dim + in_dim) for u and v

Projection: enforce ||W||_2 ≤ c by rescaling (spectral normalization as a hard constraint)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 using Matrix = vector<vector<double>>;
5 using Vector = vector<double>;
6 
7 Vector matVec(const Matrix& W, const Vector& x) { size_t m=W.size(), n=W[0].size(); Vector y(m,0.0); for(size_t i=0;i<m;++i){ double s=0.0; for(size_t j=0;j<n;++j) s+=W[i][j]*x[j]; y[i]=s;} return y; }
8 Vector matTVec(const Matrix& W, const Vector& u) { size_t m=W.size(), n=W[0].size(); Vector v(n,0.0); for(size_t j=0;j<n;++j){ double s=0.0; for(size_t i=0;i<m;++i) s+=W[i][j]*u[i]; v[j]=s;} return v; }
9 double dot(const Vector& a, const Vector& b){ double s=0.0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; }
10 double norm2(const Vector& x){ return sqrt(max(0.0, dot(x,x))); }
11 void normalize(Vector& x){ double n=norm2(x); if(n<1e-12) return; for(double& xi:x) xi/=n; }
12 
13 double spectralNorm(const Matrix& W, int iters){ size_t m=W.size(), n=W[0].size(); Vector v(n,0.0), u(m,0.0); std::mt19937 rng(9); std::normal_distribution<double> d(0.0,1.0); for(size_t j=0;j<n;++j) v[j]=d(rng); normalize(v); for(int t=0;t<iters;++t){ u=matVec(W,v); normalize(u); v=matTVec(W,u); normalize(v);} Vector Wv=matVec(W,v); return norm2(Wv); }
14 
15 void projectSpectralNorm(Matrix& W, double c, int iters=3){
16     double sigma = spectralNorm(W, iters);
17     if (sigma > c && sigma > 1e-12){
18         double scale = c / sigma;
19         for (auto& row : W) for (double& w : row) w *= scale;
20     }
21 }
22 
23 int main(){
24     size_t m=6,n=4; Matrix W(m, vector<double>(n));
25     std::mt19937 rng(1); std::normal_distribution<double> d(0.0,1.0);
26     for(size_t i=0;i<m;++i) for(size_t j=0;j<n;++j) W[i][j]=d(rng);
27     double c=1.5;
28 
29     double before = spectralNorm(W, 15);
30     projectSpectralNorm(W, c, 5);
31     double after = spectralNorm(W, 15);
32 
33     cout << fixed << setprecision(6)
34          << "Before projection, sigma_max(W)~" << before << "\n"
35          << "After projection,  sigma_max(W)~" << after  << " (target c=" << c << ")\n";
36     return 0;
37 }
38

This snippet enforces a hard spectral-norm constraint by rescaling W if its estimated spectral norm exceeds c. It demonstrates the projection step commonly used in projected gradient methods and spectral normalization layers.

Time: O(k m n) per projection (k power iterations on an m×n matrix)Space: O(m n) for W and O(m + n) extra for vectors

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	using Matrix = vector<vector<double>>;
5	using Vector = vector<double>;
6
7	// Multiply matrix W (m x n) by vector x (n)
8	Vector matVec(const Matrix& W, const Vector& x) {
9	size_t m = W.size();
10	size_t n = W[0].size();
11	Vector y(m, 0.0);
12	for (size_t i = 0; i < m; ++i) {
13	double s = 0.0;
14	for (size_t j = 0; j < n; ++j) s += W[i][j] * x[j];
15	y[i] = s;
16	}
17	return y;
18	}
19
20	// Multiply W^T (n x m) by vector u (m)
21	Vector matTVec(const Matrix& W, const Vector& u) {
22	size_t m = W.size();
23	size_t n = W[0].size();
24	Vector v(n, 0.0);
25	for (size_t j = 0; j < n; ++j) {
26	double s = 0.0;
27	for (size_t i = 0; i < m; ++i) s += W[i][j] * u[i];
28	v[j] = s;
29	}
30	return v;
31	}
32
33	double dot(const Vector& a, const Vector& b) {
34	double s = 0.0; for (size_t i = 0; i < a.size(); ++i) s += a[i] * b[i]; return s;
35	}
36
37	double norm2(const Vector& x) {
38	return sqrt(max(0.0, dot(x, x)));
39	}
40
41	void normalize(Vector& x) {
42	double n = norm2(x);
43	if (n < 1e-12) return; // avoid divide by zero
44	for (double& xi : x) xi /= n;
45	}
46
47	// Power iteration to approximate top singular value and vectors
48	// Returns estimated sigma; outputs u and v if provided
49	double spectralNormPowerIteration(const Matrix& W, int iters, Vector* u_out, Vector* v_out) {
50	size_t m = W.size();
51	size_t n = W[0].size();
52	// Initialize v with random values
53	std::mt19937 rng(42);
54	std::normal_distribution<double> dist(0.0, 1.0);
55	Vector v(n);
56	for (size_t j = 0; j < n; ++j) v[j] = dist(rng);
57	normalize(v);
58
59	Vector u(m, 0.0);
60	for (int t = 0; t < iters; ++t) {
61	u = matVec(W, v);
62	normalize(u);
63	v = matTVec(W, u);
64	normalize(v);
65	}
66	// Rayleigh quotient estimate of sigma
67	Vector Wv = matVec(W, v);
68	double sigma = norm2(Wv); // equivalent to u^T W v when u~Wv/\|\|Wv\|\|
69
70	if (u_out) *u_out = u;
71	if (v_out) *v_out = v;
72	return sigma;
73	}
74
75	int main() {
76	// Example matrix (m x n)
77	size_t m = 4, n = 3;
78	Matrix W(m, vector<double>(n, 0.0));
79	// Fill W with some values
80	double val = 0.1;
81	for (size_t i = 0; i < m; ++i)
82	for (size_t j = 0; j < n; ++j)
83	W[i][j] = val += 0.17; // arbitrary nontrivial entries
84
85	Vector u, v;
86	double sigma = spectralNormPowerIteration(W, 20, &u, &v);
87
88	cout << fixed << setprecision(6);
89	cout << "Estimated spectral norm (sigma_max): " << sigma << "\n";
90	cout << "u (top left singular vector): "; for (double x : u) cout << x << ' '; cout << "\n";
91	cout << "v (top right singular vector): "; for (double x : v) cout << x << ' '; cout << "\n";
92	return 0;
93	}
94