Loss Landscape Analysis
Key Points
- ā¢A loss landscape is the āterrainā of a modelās loss as you move through parameter space; valleys are good solutions and peaks are bad ones.
- ā¢We analyze landscapes using low-dimensional slices around a trained point to visualize curvature, flatness, and saddles.
- ā¢Flat minima often correlate with better generalization; sharp minima can fit training data but fail on unseen data.
- ā¢Practical analysis includes grid evaluations, line searches, gradient/Hessian-based curvature estimates, and random-direction projections.
- ā¢Correct scaling and normalization of directions are crucial so that axes are comparable and plots are meaningful.
- ā¢Loss landscapes are high-dimensional, so any 2D plot is only a projection and can hide or exaggerate features.
- ā¢You can prototype analysis in C++ by evaluating loss on grids and along directions and by estimating top Hessian eigenvalues.
- ā¢Compute and export CSV data from C++ and plot with external tools (Python/Matplotlib, gnuplot) for clear visualizations.
Prerequisites
- āMultivariable calculus (gradients and Hessians) ā Loss landscape geometry relies on understanding derivatives and curvature.
- āLinear algebra (vectors, matrices, eigenvalues) ā Directions, orthonormalization, and Hessian eigen-analysis require these tools.
- āOptimization basics (gradient descent) ā Landscape analysis interprets and diagnoses the behavior of optimizers.
- āProbability and statistics ā Loss functions are empirical risks; noise and generalization require statistical thinking.
- āLogistic and linear regression ā Provide simple, convex examples where landscapes can be computed exactly.
- āNumerical stability and scaling ā Normalization of directions and safe evaluations prevent misleading plots.
- āData handling and file I/O in C++ ā We export CSV grids from C++ for visualization.
- āPlotting tools (external) ā Visualizing the exported CSV with heatmaps/contours completes the analysis.
Detailed Explanation
Tap terms for definitions01Overview
Hook: Imagine hiking in fog across unknown mountains. Your goal is to find the lowest valley, but you can only feel the slope beneath your feet. Thatās what optimization feels like when training machine learning models. Concept: The loss landscape is the function that maps every possible setting of a modelās parameters to a number measuring how badly the model performs. Visualizing or analyzing this surface helps us understand why optimization methods succeed or get stuck, and why some solutions generalize better than others. Because modern models have millions of parameters, the full surface is impossible to draw, but we can still study meaningful 1D and 2D slices, curvature, and local geometry. Example: After training a logistic regression classifier, we can pick two random directions in parameter space and evaluate the loss on a small grid around the found parameters. The resulting heatmap might reveal a wide, flat basin (good) or a sharp pit (risky for generalization).
02Intuition & Analogies
Hook: Picture a drone scanning a landscape: smooth rolling hills are easy to traverse, while jagged cliffs and narrow ravines are dangerous and unpredictable. Training a model is like flying that drone using only local slope (the gradient) information. Concept: Flat valleys in the terrain mean the loss doesnāt change much if you move a littleāso the solution is robust to small perturbations, often implying better generalization. Sharp valleys (or spikes) mean tiny moves can worsen performance drastically; they usually arise from overfitting or overly aggressive training. Saddles are like passes between mountains: the slope is zero, but you can go down in some directions and up in others, confusing simple optimizers. Because we canāt see in 1,000,000 dimensions, we use projections: walk along one direction (1D line) or a plane spanned by two directions (2D) and record the loss. Normalizing these directions is like choosing consistent step sizes so that the map scale is fair. Example: Take parameters Īø at the end of training. Choose a random unit vector d and plot loss versus α in L(Īø + αd). If the curve is shallow near α=0, the minimum is flat; if it spikes quickly, the minimum is sharp. Repeat with two orthonormal directions d1 and d2 to get a 2D heatmap that looks like a basin (flat) or a crater (sharp).
03Formal Definition
04When to Use
Hook: If training stalls, overfits, or behaves erratically, the shape of the landscape can reveal why. Concept: Use loss landscape analysis to diagnose optimization issues (plateaus, saddles), compare optimizers (SGD vs. Adam), tune regularization and learning rates, and assess robustness. Itās also useful in research to understand why certain architectures (e.g., residual networks) are easier to optimize and why flat minima often generalize better. Visual patternsāwide basins vs. needle-like pitsāinform model and hyperparameter choices. Example: ⢠During hyperparameter tuning, plot 1D loss curves along random directions around the final solution; if curves are spiky, lower the learning rate or increase weight decay. ⢠When comparing two checkpoints, draw a linear interpolation curve L((1-t)\theta_{A} + t\theta_{B}) to see if the path is barrier-free (compatible minima) or has a peak (different basins). ⢠For a small logistic regression or MLP, compute the top Hessian eigenvalue to quantify sharpness numerically.
ā ļøCommon Mistakes
Hook: Pretty plots can mislead if produced carelessly. Concept: Common pitfalls include (1) unnormalized directions, which distort axes and exaggerate curvature; (2) too coarse grids that miss narrow structures; (3) relying solely on training loss, ignoring validation loss; (4) interpreting a single 2D slice as the whole story; (5) stochastic noise from mini-batches masking true geometry; and (6) parameter symmetries (like neuron permutations or scale invariances) that create deceptive flatness or ridges. Example: ⢠If you plot g(\alpha,\beta) without orthonormalizing d1 and d2, the heatmap may look skewed, falsely suggesting anisotropy. ⢠Using a small batch for evaluation adds noise; recompute loss on the full dataset for clean plots. ⢠Two networks with BatchNorm have scale invariances; compare landscapes only after applying appropriate normalization so axes correspond to meaningful perturbation sizes.
Key Formulas
Empirical risk
Explanation: This defines the loss as the average of per-example losses over the training set. It is the primary surface we study during optimization.
Gradient and Hessian
Explanation: The gradient gives the direction of steepest increase of loss, while the Hessian encodes curvature in all directions. They determine local geometry near any point.
Second-order Taylor approximation
Explanation: Near a point, loss changes are approximated by a linear term plus a quadratic curvature term. At minima where the gradient is small, curvature dominates local behavior.
1D/2D slices
Explanation: These restrict the high-dimensional surface to a line or plane using normalized directions. They make visualization and interpretation feasible.
Spectral sharpness
Explanation: The largest eigenvalue of the Hessian equals the maximum quadratic curvature along any unit direction. It is a principled measure of sharpness.
\epsilon-sharpness
Explanation: The worst-case loss increase within a small ball scales with the top Hessian eigenvalue. This connects curvature to robustness of a solution.
GramāSchmidt for directions
Explanation: This orthonormalizes two random directions so that axes in the 2D slice are perpendicular and scaled equally. It avoids distortions in the heatmap.
Linear regression MSE
Explanation: For 1D inputs, the loss is the mean squared error between true targets and line predictions. Its landscape over slope and intercept is a convex bowl.
Logistic loss
Explanation: Binary cross-entropy for logistic regression defines a convex surface in parameters. It is useful for illustrating slices and curvature analytically.
Logistic regression Hessian
Explanation: The Hessian equals a data-weighted covariance where weights are p(1-p). It is positive semi-definite, making the loss convex.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Compute Mean Squared Error for y = w*x + b on dataset (x[i], y[i]). 5 double mse_loss(const vector<double>& x, const vector<double>& y, double w, double b) { 6 const int n = (int)x.size(); 7 double sumsq = 0.0; 8 for (int i = 0; i < n; ++i) { 9 double pred = w * x[i] + b; 10 double diff = y[i] - pred; 11 sumsq += diff * diff; 12 } 13 return sumsq / n; 14 } 15 16 int main() { 17 ios::sync_with_stdio(false); 18 cin.tie(nullptr); 19 20 // 1) Generate synthetic linear data y = 3x + 2 + noise 21 int n = 200; 22 vector<double> x(n), y(n); 23 std::mt19937 rng(42); 24 std::uniform_real_distribution<double> ux(-2.0, 2.0); 25 std::normal_distribution<double> noise(0.0, 0.4); 26 for (int i = 0; i < n; ++i) { 27 x[i] = ux(rng); 28 y[i] = 3.0 * x[i] + 2.0 + noise(rng); 29 } 30 31 // 2) Define a grid over (w, b) 32 int G = 101; // grid resolution per axis 33 double w_min = 0.0, w_max = 6.0; 34 double b_min = -1.0, b_max = 5.0; 35 36 // 3) Evaluate loss on grid and write CSV for plotting 37 ofstream out("linear_surface.csv"); 38 out << "w,b,loss\n"; 39 40 double best_w = 0, best_b = 0, best_loss = numeric_limits<double>::infinity(); 41 42 for (int i = 0; i < G; ++i) { 43 double w = w_min + (w_max - w_min) * i / (G - 1); 44 for (int j = 0; j < G; ++j) { 45 double b = b_min + (b_max - b_min) * j / (G - 1); 46 double L = mse_loss(x, y, w, b); 47 out << w << "," << b << "," << L << "\n"; 48 if (L < best_loss) { 49 best_loss = L; best_w = w; best_b = b; 50 } 51 } 52 } 53 out.close(); 54 55 cerr << fixed << setprecision(6); 56 cerr << "Best on grid: w=" << best_w << ", b=" << best_b << ", loss=" << best_loss << "\n"; 57 cerr << "CSV written to linear_surface.csv (columns: w,b,loss). Plot with your preferred tool." << "\n"; 58 59 return 0; 60 } 61
This program creates a simple 1D regression dataset and evaluates the mean squared error on a 2D grid of slope (w) and intercept (b). It writes a CSV file for external visualization, revealing a convex bowl-shaped surface. The grid minimum approximates the optimal least-squares parameters.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct Example { vector<double> x; int y; }; 5 6 // Sigmoid function 7 static inline double sigmoid(double z) { return 1.0 / (1.0 + exp(-z)); } 8 9 // Compute logistic (binary cross-entropy) loss given parameters theta = [w1, w2, ..., b] 10 double logistic_loss(const vector<Example>& data, const vector<double>& theta) { 11 int p = (int)theta.size() - 1; // last element is bias b 12 double loss = 0.0; 13 for (const auto& e : data) { 14 double z = theta.back(); // bias 15 for (int j = 0; j < p; ++j) z += theta[j] * e.x[j]; 16 double p1 = sigmoid(z); 17 p1 = min(max(p1, 1e-12), 1.0 - 1e-12); // numeric safety 18 loss += - (e.y ? log(p1) : log(1.0 - p1)); 19 } 20 return loss / data.size(); 21 } 22 23 // Gradient of logistic loss 24 vector<double> logistic_grad(const vector<Example>& data, const vector<double>& theta) { 25 int p = (int)theta.size() - 1; 26 vector<double> g(p + 1, 0.0); 27 for (const auto& e : data) { 28 double z = theta.back(); 29 for (int j = 0; j < p; ++j) z += theta[j] * e.x[j]; 30 double p1 = sigmoid(z); 31 double diff = (p1 - e.y); // derivative of BCE wrt z 32 for (int j = 0; j < p; ++j) g[j] += diff * e.x[j]; 33 g.back() += diff; // bias term 34 } 35 for (double& v : g) v /= data.size(); 36 return g; 37 } 38 39 // Train logistic regression via gradient descent 40 vector<double> train_logreg(const vector<Example>& data, int p, int iters=2000, double lr=0.5) { 41 vector<double> theta(p + 1, 0.0); // initialize to zeros 42 for (int t = 0; t < iters; ++t) { 43 vector<double> g = logistic_grad(data, theta); 44 double eta = lr / sqrt(1.0 + t * 0.01); // mild decay 45 for (int j = 0; j <= p; ++j) theta[j] -= eta * g[j]; 46 } 47 return theta; 48 } 49 50 // Orthonormalize two random directions using Gram-Schmidt 51 pair<vector<double>, vector<double>> two_orthonormal_directions(int dim, std::mt19937& rng) { 52 std::normal_distribution<double> N(0.0, 1.0); 53 vector<double> d1(dim), d2(dim); 54 for (int i = 0; i < dim; ++i) { d1[i] = N(rng); d2[i] = N(rng); } 55 auto norm = [](const vector<double>& v){ double s=0; for(double x:v) s+=x*x; return sqrt(max(1e-18, s)); }; 56 auto dot = [](const vector<double>& a, const vector<double>& b){ double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; }; 57 58 // Normalize d1 59 double n1 = norm(d1); for (double& x : d1) x /= n1; 60 // Make d2 orthogonal to d1 61 double proj = dot(d2, d1); 62 for (int i = 0; i < dim; ++i) d2[i] -= proj * d1[i]; 63 // Normalize d2 64 double n2 = norm(d2); for (double& x : d2) x /= n2; 65 return {d1, d2}; 66 } 67 68 int main() { 69 ios::sync_with_stdio(false); 70 cin.tie(nullptr); 71 72 // 1) Create a separable 2D dataset (two Gaussian blobs) 73 std::mt19937 rng(123); 74 int n_per_class = 200; 75 vector<Example> data; data.reserve(2 * n_per_class); 76 normal_distribution<double> A1( 2.0, 1.0), A2( 2.0, 1.0); 77 normal_distribution<double> B1(-2.0, 1.0), B2(-2.0, 1.0); 78 for (int i = 0; i < n_per_class; ++i) { 79 data.push_back({{A1(rng), A2(rng)}, 1}); 80 data.push_back({{B1(rng), B2(rng)}, 0}); 81 } 82 83 // 2) Train logistic regression (parameters: w1, w2, b) 84 int p = 2; 85 vector<double> theta = train_logreg(data, p, 1500, 0.4); 86 cerr << fixed << setprecision(6); 87 cerr << "Trained loss = " << logistic_loss(data, theta) << "\n"; 88 89 // 3) Build two orthonormal directions in R^3 around theta 90 auto [d1, d2] = two_orthonormal_directions(p + 1, rng); 91 92 // 4) Evaluate 2D slice g(alpha, beta) = L(theta + alpha*d1 + beta*d2) 93 int G = 101; // grid resolution per axis 94 double r = 1.5; // radius along each direction 95 ofstream out("logreg_slice.csv"); 96 out << "alpha,beta,loss\n"; 97 for (int i = 0; i < G; ++i) { 98 double alpha = -r + 2*r * i / (G - 1); 99 for (int j = 0; j < G; ++j) { 100 double beta = -r + 2*r * j / (G - 1); 101 vector<double> th = theta; 102 for (int k = 0; k <= p; ++k) th[k] += alpha * d1[k] + beta * d2[k]; 103 double L = logistic_loss(data, th); 104 out << alpha << "," << beta << "," << L << "\n"; 105 } 106 } 107 out.close(); 108 cerr << "2D slice written to logreg_slice.csv (columns: alpha,beta,loss)." << "\n"; 109 110 return 0; 111 } 112
We generate two Gaussian clusters, train a 2D logistic regression, and then probe the landscape by evaluating a 2D slice around the trained parameters along two orthonormal directions. The resulting CSV can be plotted as a heatmap or contour plot, revealing local flatness or sharpness of the found solution.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct Example { vector<double> x; int y; }; 5 6 static inline double sigmoid(double z) { return 1.0 / (1.0 + exp(-z)); } 7 8 double logistic_loss(const vector<Example>& data, const vector<double>& theta) { 9 int p = (int)theta.size() - 1; double loss = 0.0; 10 for (const auto& e : data) { 11 double z = theta.back(); 12 for (int j = 0; j < p; ++j) z += theta[j] * e.x[j]; 13 double p1 = sigmoid(z); 14 p1 = min(max(p1, 1e-12), 1.0 - 1e-12); 15 loss += - (e.y ? log(p1) : log(1.0 - p1)); 16 } 17 return loss / data.size(); 18 } 19 20 // Compute Hessian for logistic regression explicitly: H = (1/n) sum p(1-p) x x^T (including bias as an extra feature 1) 21 vector<vector<double>> logistic_hessian(const vector<Example>& data, const vector<double>& theta) { 22 int p = (int)theta.size() - 1; // features + bias 23 int d = p + 1; // augmented dimension including bias 24 vector<vector<double>> H(d, vector<double>(d, 0.0)); 25 for (const auto& e : data) { 26 // augmented feature a = [x; 1] 27 vector<double> a(d); 28 for (int j = 0; j < p; ++j) a[j] = e.x[j]; 29 a[p] = 1.0; 30 double z = theta[p]; 31 for (int j = 0; j < p; ++j) z += theta[j] * e.x[j]; 32 double p1 = sigmoid(z); 33 double w = p1 * (1.0 - p1); 34 for (int i = 0; i < d; ++i) { 35 for (int j = 0; j < d; ++j) { 36 H[i][j] += w * a[i] * a[j]; 37 } 38 } 39 } 40 double invn = 1.0 / data.size(); 41 for (auto& row : H) for (double& x : row) x *= invn; 42 return H; 43 } 44 45 // Power iteration to estimate largest eigenvalue of symmetric matrix H 46 pair<double, vector<double>> top_eigenpair(const vector<vector<double>>& H, int iters=100) { 47 int d = (int)H.size(); 48 std::mt19937 rng(777); 49 std::normal_distribution<double> N(0.0, 1.0); 50 vector<double> v(d); 51 for (int i = 0; i < d; ++i) v[i] = N(rng); 52 auto norm2 = [](const vector<double>& x){ double s=0; for(double v:x) s+=v*v; return sqrt(max(1e-18, s)); }; 53 54 for (int t = 0; t < iters; ++t) { 55 vector<double> Hv(d, 0.0); 56 for (int i = 0; i < d; ++i) for (int j = 0; j < d; ++j) Hv[i] += H[i][j] * v[j]; 57 double nrm = norm2(Hv); 58 for (int i = 0; i < d; ++i) v[i] = Hv[i] / nrm; 59 } 60 // Rayleigh quotient as eigenvalue estimate 61 double num = 0.0, den = 0.0; 62 vector<double> Hv(d, 0.0); 63 for (int i = 0; i < d; ++i) for (int j = 0; j < d; ++j) Hv[i] += H[i][j] * v[j]; 64 for (int i = 0; i < d; ++i) { num += v[i] * Hv[i]; den += v[i] * v[i]; } 65 double lambda = num / den; 66 return {lambda, v}; 67 } 68 69 int main(){ 70 ios::sync_with_stdio(false); 71 cin.tie(nullptr); 72 73 // Create simple dataset 74 std::mt19937 rng(42); 75 int n = 400; 76 normal_distribution<double> A1( 2.0, 1.0), A2( 2.0, 1.0); 77 normal_distribution<double> B1(-2.0, 1.0), B2(-2.0, 1.0); 78 vector<Example> data; data.reserve(n); 79 for (int i = 0; i < n/2; ++i) data.push_back({{A1(rng), A2(rng)}, 1}); 80 for (int i = 0; i < n/2; ++i) data.push_back({{B1(rng), B2(rng)}, 0}); 81 82 // Train logistic regression quickly (few iterations suffice) 83 auto grad = [&](const vector<double>& theta){ 84 int p = (int)theta.size() - 1; vector<double> g(p+1,0.0); 85 for (const auto& e : data) { 86 double z = theta[p]; for (int j=0;j<p;++j) z += theta[j]*e.x[j]; 87 double p1 = 1.0/(1.0+exp(-z)); double d = (p1 - e.y); 88 for (int j=0;j<p;++j) g[j]+=d*e.x[j]; g[p]+=d; 89 } 90 for (double& v: g) v /= data.size(); return g; }; 91 92 int p = 2; vector<double> theta(p+1,0.0); 93 double lr = 0.4; int iters = 800; 94 for (int t=0;t<iters;++t){ auto g = grad(theta); double eta = lr/sqrt(1.0+t*0.01); for(int j=0;j<=p;++j) theta[j]-=eta*g[j]; } 95 96 // Compute Hessian and estimate sharpness 97 auto H = logistic_hessian(data, theta); 98 auto [lambda_max, v] = top_eigenpair(H, 100); 99 100 cerr << fixed << setprecision(6); 101 cerr << "Trained loss = " << logistic_loss(data, theta) << "\n"; 102 cerr << "Estimated top Hessian eigenvalue (sharpness) = " << lambda_max << "\n"; 103 104 // Optional: predict epsilon-sharpness via quadratic approximation 105 double eps = 0.5; double s_eps = 0.5 * eps * eps * lambda_max; 106 cerr << "Approx. epsilon-sharpness for eps=0.5 is " << s_eps << "\n"; 107 108 return 0; 109 } 110
This program trains a small logistic regression model, forms its exact Hessian (feasible at small dimension), and estimates the largest eigenvalue using power iteration. The top eigenvalue quantifies local curvature (sharpness) and predicts the worst-case loss increase within a small radius via the quadratic approximation.