📚TheoryIntermediate

Loss Landscape Analysis

Key Points

•
A loss landscape is the “terrain” of a model’s loss as you move through parameter space; valleys are good solutions and peaks are bad ones.
•
We analyze landscapes using low-dimensional slices $\frac{1 D}{2 D}$ around a trained point to visualize curvature, flatness, and saddles.
•
Flat minima often correlate with better generalization; sharp minima can fit training data but fail on unseen data.
•
Practical analysis includes grid evaluations, line searches, gradient/Hessian-based curvature estimates, and random-direction projections.
•
Correct scaling and normalization of directions are crucial so that axes are comparable and plots are meaningful.
•
Loss landscapes are high-dimensional, so any 2D plot is only a projection and can hide or exaggerate features.
•
You can prototype analysis in C++ by evaluating loss on grids and along directions and by estimating top Hessian eigenvalues.
•
Compute and export CSV data from C++ and plot with external tools (Python/Matplotlib, gnuplot) for clear visualizations.

Prerequisites

→Multivariable calculus (gradients and Hessians) — Loss landscape geometry relies on understanding derivatives and curvature.
→Linear algebra (vectors, matrices, eigenvalues) — Directions, orthonormalization, and Hessian eigen-analysis require these tools.
→Optimization basics (gradient descent) — Landscape analysis interprets and diagnoses the behavior of optimizers.
→Probability and statistics — Loss functions are empirical risks; noise and generalization require statistical thinking.
→Logistic and linear regression — Provide simple, convex examples where landscapes can be computed exactly.
→Numerical stability and scaling — Normalization of directions and safe evaluations prevent misleading plots.
→Data handling and file I/O in C++ — We export CSV grids from C++ for visualization.
→Plotting tools (external) — Visualizing the exported CSV with heatmaps/contours completes the analysis.

Detailed Explanation

Tap terms for definitions

01Overview

Hook: Imagine hiking in fog across unknown mountains. Your goal is to find the lowest valley, but you can only feel the slope beneath your feet. That’s what optimization feels like when training machine learning models. Concept: The loss landscape is the function that maps every possible setting of a model’s parameters to a number measuring how badly the model performs. Visualizing or analyzing this surface helps us understand why optimization methods succeed or get stuck, and why some solutions generalize better than others. Because modern models have millions of parameters, the full surface is impossible to draw, but we can still study meaningful 1D and 2D slices, curvature, and local geometry. Example: After training a logistic regression classifier, we can pick two random directions in parameter space and evaluate the loss on a small grid around the found parameters. The resulting heatmap might reveal a wide, flat basin (good) or a sharp pit (risky for generalization).

02Intuition & Analogies

Hook: Picture a drone scanning a landscape: smooth rolling hills are easy to traverse, while jagged cliffs and narrow ravines are dangerous and unpredictable. Training a model is like flying that drone using only local slope (the gradient) information. Concept: Flat valleys in the terrain mean the loss doesn’t change much if you move a little—so the solution is robust to small perturbations, often implying better generalization. Sharp valleys (or spikes) mean tiny moves can worsen performance drastically; they usually arise from overfitting or overly aggressive training. Saddles are like passes between mountains: the slope is zero, but you can go down in some directions and up in others, confusing simple optimizers. Because we can’t see in 1,000,000 dimensions, we use projections: walk along one direction (1D line) or a plane spanned by two directions (2D) and record the loss. Normalizing these directions is like choosing consistent step sizes so that the map scale is fair. Example: Take parameters θ at the end of training. Choose a random unit vector d and plot loss versus α in L(θ + αd). If the curve is shallow near α=0, the minimum is flat; if it spikes quickly, the minimum is sharp. Repeat with two orthonormal directions d1 and d2 to get a 2D heatmap that looks like a basin (flat) or a crater (sharp).

03Formal Definition

Hook: We can move from pictures to precise math by defining the landscape and its local geometry. Concept: Let L:

R^{p}

\to

R

map parameters

θ

\in

R^{p}

to a scalar loss. The gradient

\nabla

θ

) indicates the steepest ascent, and the Hessian H(

θ

) =

\nabla^{2}

θ

) describes local curvature. Critical points satisfy

\nabla

θ^{*}

) = 0 and are categorized by Hessian eigenvalues: all positive (local minimum), all negative (local maximum), or mixed signs (saddle). For visualization, define 1D and 2D slices: f(

α

) = L(

θ_{0}

α

d) and g(

α

β

) = L(

θ_{0}

α

d_{1}

β

d_{2}

), where d,

d_{1}

d_{2}

are normalized directions (often orthonormal). Example: Using a second-order Taylor expansion, L(

θ_{0}

δ

)

\approx

θ_{0}

) +

\nabla

θ_{0}

)^{

⊤

}

δ

\frac{1}{2}

δ^{⊤}

θ_{0}

)

δ

. Near a minimum where

\nabla

\approx

0, curvature is dominated by the quadratic term and governed by Hessian eigenvalues; the largest eigenvalue approximates the sharpest ascent of loss.

04When to Use

Hook: If training stalls, overfits, or behaves erratically, the shape of the landscape can reveal why. Concept: Use loss landscape analysis to diagnose optimization issues (plateaus, saddles), compare optimizers (SGD vs. Adam), tune regularization and learning rates, and assess robustness. It’s also useful in research to understand why certain architectures (e.g., residual networks) are easier to optimize and why flat minima often generalize better. Visual patterns—wide basins vs. needle-like pits—inform model and hyperparameter choices. Example: • During hyperparameter tuning, plot 1D loss curves along random directions around the final solution; if curves are spiky, lower the learning rate or increase weight decay. • When comparing two checkpoints, draw a linear interpolation curve L((1-t)\theta_{A} + t\theta_{B}) to see if the path is barrier-free (compatible minima) or has a peak (different basins). • For a small logistic regression or MLP, compute the top Hessian eigenvalue to quantify sharpness numerically.

⚠️Common Mistakes

Hook: Pretty plots can mislead if produced carelessly. Concept: Common pitfalls include (1) unnormalized directions, which distort axes and exaggerate curvature; (2) too coarse grids that miss narrow structures; (3) relying solely on training loss, ignoring validation loss; (4) interpreting a single 2D slice as the whole story; (5) stochastic noise from mini-batches masking true geometry; and (6) parameter symmetries (like neuron permutations or scale invariances) that create deceptive flatness or ridges. Example: • If you plot g(\alpha,\beta) without orthonormalizing d1 and d2, the heatmap may look skewed, falsely suggesting anisotropy. • Using a small batch for evaluation adds noise; recompute loss on the full dataset for clean plots. • Two networks with BatchNorm have scale invariances; compare landscapes only after applying appropriate normalization so axes correspond to meaningful perturbation sizes.

Key Formulas

Empirical risk

L (θ) = \frac{1}{n} i = 1 \sum n ℓ (f_{θ} (x_{i}), y_{i})

Explanation: This defines the loss as the average of per-example losses over the training set. It is the primary surface we study during optimization.

Gradient and Hessian

\nabla L (θ) = [\frac{\partial L}{\partial θ _{1}}, \dots, \frac{\partial L}{\partial θ _{p}}]^{⊤}, H (θ) = \nabla^{2} L (θ)

Explanation: The gradient gives the direction of steepest increase of loss, while the Hessian encodes curvature in all directions. They determine local geometry near any point.

Second-order Taylor approximation

L (θ + δ) \approx L (θ) + \nabla L (θ)^{⊤} δ + \frac{1}{2} δ^{⊤} H (θ) δ

Explanation: Near a point, loss changes are approximated by a linear term plus a quadratic curvature term. At minima where the gradient is small, curvature dominates local behavior.

1D/2D slices

f (α) = L (θ_{0} + α d), g (α, β) = L (θ_{0} + α d_{1} + β d_{2})

Explanation: These restrict the high-dimensional surface to a line or plane using normalized directions. They make visualization and interpretation feasible.

Spectral sharpness

λ_{m a x} (H (θ)) = ∥ u ∥_{2} = 1 max u^{⊤} H (θ) u

Explanation: The largest eigenvalue of the Hessian equals the maximum quadratic curvature along any unit direction. It is a principled measure of sharpness.

\epsilon-sharpness

S_{ϵ} (θ) = ∥ δ ∥_{2} \leq ϵ max (L (θ + δ) - L (θ)) \approx \frac{1}{2} ϵ^{2} λ_{m a x} (H (θ))

Explanation: The worst-case loss increase within a small ball scales with the top Hessian eigenvalue. This connects curvature to robustness of a solution.

Gram–Schmidt for directions

\tilde{d}_{1} = \frac{d _{1}}{∥ d _{1} ∥ _{2}}, \tilde{d}_{2} = \frac{d _{2} - ( d ~ _{1}^{⊤} d _{2} ) d ~ _{1}}{∥ d _{2} - ( d ~ _{1}^{⊤} d _{2} ) d ~ _{1} ∥ _{2}}

Explanation: This orthonormalizes two random directions so that axes in the 2D slice are perpendicular and scaled equally. It avoids distortions in the heatmap.

Linear regression MSE

ℓ_{MSE} (θ) = \frac{1}{n} i = 1 \sum n (y_{i} - (w x_{i} + b))^{2}

Explanation: For 1D inputs, the loss is the mean squared error between true targets and line predictions. Its landscape over slope and intercept is a convex bowl.

Logistic loss

ℓ_{logistic} (θ) = \frac{1}{n} i = 1 \sum n [- y_{i} lo g σ (z_{i}) - (1 - y_{i}) lo g (1 - σ (z_{i}))], z_{i} = w^{⊤} x_{i} + b

Explanation: Binary cross-entropy for logistic regression defines a convex surface in parameters. It is useful for illustrating slices and curvature analytically.

Logistic regression Hessian

H_{logistic} (θ) = \frac{1}{n} i = 1 \sum n σ (z_{i}) (1 - σ (z_{i})) x_{i} x_{i}^{⊤}

Explanation: The Hessian equals a data-weighted covariance where weights are p(1-p). It is positive semi-definite, making the loss convex.

Complexity Analysis

Evaluating a loss landscape typically multiplies the base cost of computing loss by the number of probe points. If computing the loss on n examples with p parameters costs O(n·p) (common for linear and logistic models with dense features), then a 2D grid with G × G points costs O(

G^{2}

· n · p). Memory is usually dominated by storing the data O(n·

p_{d} a t a

) and the grid of results O(

G^{2}

). When p is large, even forming gradients or Hessians exactly may be expensive: gradients are O(n·p), while full Hessians are O(n·

p^{2}

) to compute and O(

p^{2}

) to store, which is infeasible for modern deep networks. In such cases, we prefer Hessian–vector products or finite-difference approximations that avoid materializing H explicitly, reducing per product to roughly the cost of a gradient, O(n·p). For 1D slices with R sample points, the cost is O(R · n · p), which scales linearly in the number of evaluations and is thus tractable for moderate R (e.g., 100–200). Power iteration to estimate the largest Hessian eigenvalue requires K iterations, each dominated by an H·v computation. For logistic regression with explicit Hessian, this is O(n·p +

p^{2}

) per iteration (forming weighted design products), yielding total O(K·(n·p +

p^{2}

)); if using implicit Hessian–vector products, it becomes O(K·n·p) with only O(p) memory. In practice, choose grid sizes and iteration counts to balance resolution and runtime (e.g., G in [51, 101], K in [20, 100]).

Code Examples

2D loss surface for linear regression (slope vs. intercept) and CSV export

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Compute Mean Squared Error for y = w*x + b on dataset (x[i], y[i]).
5 double mse_loss(const vector<double>& x, const vector<double>& y, double w, double b) {
6     const int n = (int)x.size();
7     double sumsq = 0.0;
8     for (int i = 0; i < n; ++i) {
9         double pred = w * x[i] + b;
10         double diff = y[i] - pred;
11         sumsq += diff * diff;
12     }
13     return sumsq / n;
14 }
15 
16 int main() {
17     ios::sync_with_stdio(false);
18     cin.tie(nullptr);
19 
20     // 1) Generate synthetic linear data y = 3x + 2 + noise
21     int n = 200;
22     vector<double> x(n), y(n);
23     std::mt19937 rng(42);
24     std::uniform_real_distribution<double> ux(-2.0, 2.0);
25     std::normal_distribution<double> noise(0.0, 0.4);
26     for (int i = 0; i < n; ++i) {
27         x[i] = ux(rng);
28         y[i] = 3.0 * x[i] + 2.0 + noise(rng);
29     }
30 
31     // 2) Define a grid over (w, b)
32     int G = 101;                  // grid resolution per axis
33     double w_min = 0.0, w_max = 6.0;
34     double b_min = -1.0, b_max = 5.0;
35 
36     // 3) Evaluate loss on grid and write CSV for plotting
37     ofstream out("linear_surface.csv");
38     out << "w,b,loss\n";
39 
40     double best_w = 0, best_b = 0, best_loss = numeric_limits<double>::infinity();
41 
42     for (int i = 0; i < G; ++i) {
43         double w = w_min + (w_max - w_min) * i / (G - 1);
44         for (int j = 0; j < G; ++j) {
45             double b = b_min + (b_max - b_min) * j / (G - 1);
46             double L = mse_loss(x, y, w, b);
47             out << w << "," << b << "," << L << "\n";
48             if (L < best_loss) {
49                 best_loss = L; best_w = w; best_b = b;
50             }
51         }
52     }
53     out.close();
54 
55     cerr << fixed << setprecision(6);
56     cerr << "Best on grid: w=" << best_w << ", b=" << best_b << ", loss=" << best_loss << "\n";
57     cerr << "CSV written to linear_surface.csv (columns: w,b,loss). Plot with your preferred tool." << "\n";
58 
59     return 0;
60 }
61

This program creates a simple 1D regression dataset and evaluates the mean squared error on a 2D grid of slope (w) and intercept (b). It writes a CSV file for external visualization, revealing a convex bowl-shaped surface. The grid minimum approximates the optimal least-squares parameters.

Time: O(G^2 · n) for grid evaluation (each loss is O(n)).Space: O(n) to store the dataset plus O(1) for streaming CSV (no grid kept in memory).

2D slice of logistic regression loss around a trained solution using orthonormal random directions

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct Example { vector<double> x; int y; };
5 
6 // Sigmoid function
7 static inline double sigmoid(double z) { return 1.0 / (1.0 + exp(-z)); }
8 
9 // Compute logistic (binary cross-entropy) loss given parameters theta = [w1, w2, ..., b]
10 double logistic_loss(const vector<Example>& data, const vector<double>& theta) {
11     int p = (int)theta.size() - 1; // last element is bias b
12     double loss = 0.0;
13     for (const auto& e : data) {
14         double z = theta.back(); // bias
15         for (int j = 0; j < p; ++j) z += theta[j] * e.x[j];
16         double p1 = sigmoid(z);
17         p1 = min(max(p1, 1e-12), 1.0 - 1e-12); // numeric safety
18         loss += - (e.y ? log(p1) : log(1.0 - p1));
19     }
20     return loss / data.size();
21 }
22 
23 // Gradient of logistic loss
24 vector<double> logistic_grad(const vector<Example>& data, const vector<double>& theta) {
25     int p = (int)theta.size() - 1;
26     vector<double> g(p + 1, 0.0);
27     for (const auto& e : data) {
28         double z = theta.back();
29         for (int j = 0; j < p; ++j) z += theta[j] * e.x[j];
30         double p1 = sigmoid(z);
31         double diff = (p1 - e.y); // derivative of BCE wrt z
32         for (int j = 0; j < p; ++j) g[j] += diff * e.x[j];
33         g.back() += diff; // bias term
34     }
35     for (double& v : g) v /= data.size();
36     return g;
37 }
38 
39 // Train logistic regression via gradient descent
40 vector<double> train_logreg(const vector<Example>& data, int p, int iters=2000, double lr=0.5) {
41     vector<double> theta(p + 1, 0.0); // initialize to zeros
42     for (int t = 0; t < iters; ++t) {
43         vector<double> g = logistic_grad(data, theta);
44         double eta = lr / sqrt(1.0 + t * 0.01); // mild decay
45         for (int j = 0; j <= p; ++j) theta[j] -= eta * g[j];
46     }
47     return theta;
48 }
49 
50 // Orthonormalize two random directions using Gram-Schmidt
51 pair<vector<double>, vector<double>> two_orthonormal_directions(int dim, std::mt19937& rng) {
52     std::normal_distribution<double> N(0.0, 1.0);
53     vector<double> d1(dim), d2(dim);
54     for (int i = 0; i < dim; ++i) { d1[i] = N(rng); d2[i] = N(rng); }
55     auto norm = [](const vector<double>& v){ double s=0; for(double x:v) s+=x*x; return sqrt(max(1e-18, s)); };
56     auto dot  = [](const vector<double>& a, const vector<double>& b){ double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; };
57 
58     // Normalize d1
59     double n1 = norm(d1); for (double& x : d1) x /= n1;
60     // Make d2 orthogonal to d1
61     double proj = dot(d2, d1);
62     for (int i = 0; i < dim; ++i) d2[i] -= proj * d1[i];
63     // Normalize d2
64     double n2 = norm(d2); for (double& x : d2) x /= n2;
65     return {d1, d2};
66 }
67 
68 int main() {
69     ios::sync_with_stdio(false);
70     cin.tie(nullptr);
71 
72     // 1) Create a separable 2D dataset (two Gaussian blobs)
73     std::mt19937 rng(123);
74     int n_per_class = 200;
75     vector<Example> data; data.reserve(2 * n_per_class);
76     normal_distribution<double> A1( 2.0, 1.0), A2( 2.0, 1.0);
77     normal_distribution<double> B1(-2.0, 1.0), B2(-2.0, 1.0);
78     for (int i = 0; i < n_per_class; ++i) {
79         data.push_back({{A1(rng), A2(rng)}, 1});
80         data.push_back({{B1(rng), B2(rng)}, 0});
81     }
82 
83     // 2) Train logistic regression (parameters: w1, w2, b)
84     int p = 2;
85     vector<double> theta = train_logreg(data, p, 1500, 0.4);
86     cerr << fixed << setprecision(6);
87     cerr << "Trained loss = " << logistic_loss(data, theta) << "\n";
88 
89     // 3) Build two orthonormal directions in R^3 around theta
90     auto [d1, d2] = two_orthonormal_directions(p + 1, rng);
91 
92     // 4) Evaluate 2D slice g(alpha, beta) = L(theta + alpha*d1 + beta*d2)
93     int G = 101;        // grid resolution per axis
94     double r = 1.5;     // radius along each direction
95     ofstream out("logreg_slice.csv");
96     out << "alpha,beta,loss\n";
97     for (int i = 0; i < G; ++i) {
98         double alpha = -r + 2*r * i / (G - 1);
99         for (int j = 0; j < G; ++j) {
100             double beta = -r + 2*r * j / (G - 1);
101             vector<double> th = theta;
102             for (int k = 0; k <= p; ++k) th[k] += alpha * d1[k] + beta * d2[k];
103             double L = logistic_loss(data, th);
104             out << alpha << "," << beta << "," << L << "\n";
105         }
106     }
107     out.close();
108     cerr << "2D slice written to logreg_slice.csv (columns: alpha,beta,loss)." << "\n";
109 
110     return 0;
111 }
112

We generate two Gaussian clusters, train a 2D logistic regression, and then probe the landscape by evaluating a 2D slice around the trained parameters along two orthonormal directions. The resulting CSV can be plotted as a heatmap or contour plot, revealing local flatness or sharpness of the found solution.

Time: Training: O(iters · n · p). Slice: O(G^2 · n · p).Space: O(n · p) to store data, O(p) for parameters and directions.

Estimating sharpness via top Hessian eigenvalue (logistic regression) using power iteration

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct Example { vector<double> x; int y; };
5 
6 static inline double sigmoid(double z) { return 1.0 / (1.0 + exp(-z)); }
7 
8 double logistic_loss(const vector<Example>& data, const vector<double>& theta) {
9     int p = (int)theta.size() - 1; double loss = 0.0;
10     for (const auto& e : data) {
11         double z = theta.back();
12         for (int j = 0; j < p; ++j) z += theta[j] * e.x[j];
13         double p1 = sigmoid(z);
14         p1 = min(max(p1, 1e-12), 1.0 - 1e-12);
15         loss += - (e.y ? log(p1) : log(1.0 - p1));
16     }
17     return loss / data.size();
18 }
19 
20 // Compute Hessian for logistic regression explicitly: H = (1/n) sum p(1-p) x x^T (including bias as an extra feature 1)
21 vector<vector<double>> logistic_hessian(const vector<Example>& data, const vector<double>& theta) {
22     int p = (int)theta.size() - 1; // features + bias
23     int d = p + 1; // augmented dimension including bias
24     vector<vector<double>> H(d, vector<double>(d, 0.0));
25     for (const auto& e : data) {
26         // augmented feature a = [x; 1]
27         vector<double> a(d);
28         for (int j = 0; j < p; ++j) a[j] = e.x[j];
29         a[p] = 1.0;
30         double z = theta[p];
31         for (int j = 0; j < p; ++j) z += theta[j] * e.x[j];
32         double p1 = sigmoid(z);
33         double w = p1 * (1.0 - p1);
34         for (int i = 0; i < d; ++i) {
35             for (int j = 0; j < d; ++j) {
36                 H[i][j] += w * a[i] * a[j];
37             }
38         }
39     }
40     double invn = 1.0 / data.size();
41     for (auto& row : H) for (double& x : row) x *= invn;
42     return H;
43 }
44 
45 // Power iteration to estimate largest eigenvalue of symmetric matrix H
46 pair<double, vector<double>> top_eigenpair(const vector<vector<double>>& H, int iters=100) {
47     int d = (int)H.size();
48     std::mt19937 rng(777);
49     std::normal_distribution<double> N(0.0, 1.0);
50     vector<double> v(d);
51     for (int i = 0; i < d; ++i) v[i] = N(rng);
52     auto norm2 = [](const vector<double>& x){ double s=0; for(double v:x) s+=v*v; return sqrt(max(1e-18, s)); };
53 
54     for (int t = 0; t < iters; ++t) {
55         vector<double> Hv(d, 0.0);
56         for (int i = 0; i < d; ++i) for (int j = 0; j < d; ++j) Hv[i] += H[i][j] * v[j];
57         double nrm = norm2(Hv);
58         for (int i = 0; i < d; ++i) v[i] = Hv[i] / nrm;
59     }
60     // Rayleigh quotient as eigenvalue estimate
61     double num = 0.0, den = 0.0;
62     vector<double> Hv(d, 0.0);
63     for (int i = 0; i < d; ++i) for (int j = 0; j < d; ++j) Hv[i] += H[i][j] * v[j];
64     for (int i = 0; i < d; ++i) { num += v[i] * Hv[i]; den += v[i] * v[i]; }
65     double lambda = num / den;
66     return {lambda, v};
67 }
68 
69 int main(){
70     ios::sync_with_stdio(false);
71     cin.tie(nullptr);
72 
73     // Create simple dataset
74     std::mt19937 rng(42);
75     int n = 400;
76     normal_distribution<double> A1( 2.0, 1.0), A2( 2.0, 1.0);
77     normal_distribution<double> B1(-2.0, 1.0), B2(-2.0, 1.0);
78     vector<Example> data; data.reserve(n);
79     for (int i = 0; i < n/2; ++i) data.push_back({{A1(rng), A2(rng)}, 1});
80     for (int i = 0; i < n/2; ++i) data.push_back({{B1(rng), B2(rng)}, 0});
81 
82     // Train logistic regression quickly (few iterations suffice)
83     auto grad = [&](const vector<double>& theta){
84         int p = (int)theta.size() - 1; vector<double> g(p+1,0.0);
85         for (const auto& e : data) {
86             double z = theta[p]; for (int j=0;j<p;++j) z += theta[j]*e.x[j];
87             double p1 = 1.0/(1.0+exp(-z)); double d = (p1 - e.y);
88             for (int j=0;j<p;++j) g[j]+=d*e.x[j]; g[p]+=d;
89         }
90         for (double& v: g) v /= data.size(); return g; };
91 
92     int p = 2; vector<double> theta(p+1,0.0);
93     double lr = 0.4; int iters = 800;
94     for (int t=0;t<iters;++t){ auto g = grad(theta); double eta = lr/sqrt(1.0+t*0.01); for(int j=0;j<=p;++j) theta[j]-=eta*g[j]; }
95 
96     // Compute Hessian and estimate sharpness
97     auto H = logistic_hessian(data, theta);
98     auto [lambda_max, v] = top_eigenpair(H, 100);
99 
100     cerr << fixed << setprecision(6);
101     cerr << "Trained loss = " << logistic_loss(data, theta) << "\n";
102     cerr << "Estimated top Hessian eigenvalue (sharpness) = " << lambda_max << "\n";
103 
104     // Optional: predict epsilon-sharpness via quadratic approximation
105     double eps = 0.5; double s_eps = 0.5 * eps * eps * lambda_max;
106     cerr << "Approx. epsilon-sharpness for eps=0.5 is " << s_eps << "\n";
107 
108     return 0;
109 }
110

This program trains a small logistic regression model, forms its exact Hessian (feasible at small dimension), and estimates the largest eigenvalue using power iteration. The top eigenvalue quantifies local curvature (sharpness) and predicts the worst-case loss increase within a small radius via the quadratic approximation.

Time: Training: O(iters · n · p). Hessian: O(n · p^2). Power iteration: O(K · p^2).Space: O(n · p) for data, O(p^2) to store the Hessian.

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Compute Mean Squared Error for y = w*x + b on dataset (x[i], y[i]).
5	double mse_loss(const vector<double>& x, const vector<double>& y, double w, double b) {
6	const int n = (int)x.size();
7	double sumsq = 0.0;
8	for (int i = 0; i < n; ++i) {
9	double pred = w * x[i] + b;
10	double diff = y[i] - pred;
11	sumsq += diff * diff;
12	}
13	return sumsq / n;
14	}
15
16	int main() {
17	ios::sync_with_stdio(false);
18	cin.tie(nullptr);
19
20	// 1) Generate synthetic linear data y = 3x + 2 + noise
21	int n = 200;
22	vector<double> x(n), y(n);
23	std::mt19937 rng(42);
24	std::uniform_real_distribution<double> ux(-2.0, 2.0);
25	std::normal_distribution<double> noise(0.0, 0.4);
26	for (int i = 0; i < n; ++i) {
27	x[i] = ux(rng);
28	y[i] = 3.0 * x[i] + 2.0 + noise(rng);
29	}
30
31	// 2) Define a grid over (w, b)
32	int G = 101; // grid resolution per axis
33	double w_min = 0.0, w_max = 6.0;
34	double b_min = -1.0, b_max = 5.0;
35
36	// 3) Evaluate loss on grid and write CSV for plotting
37	ofstream out("linear_surface.csv");
38	out << "w,b,loss\n";
39
40	double best_w = 0, best_b = 0, best_loss = numeric_limits<double>::infinity();
41
42	for (int i = 0; i < G; ++i) {
43	double w = w_min + (w_max - w_min) * i / (G - 1);
44	for (int j = 0; j < G; ++j) {
45	double b = b_min + (b_max - b_min) * j / (G - 1);
46	double L = mse_loss(x, y, w, b);
47	out << w << "," << b << "," << L << "\n";
48	if (L < best_loss) {
49	best_loss = L; best_w = w; best_b = b;
50	}
51	}
52	}
53	out.close();
54
55	cerr << fixed << setprecision(6);
56	cerr << "Best on grid: w=" << best_w << ", b=" << best_b << ", loss=" << best_loss << "\n";
57	cerr << "CSV written to linear_surface.csv (columns: w,b,loss). Plot with your preferred tool." << "\n";
58
59	return 0;
60	}
61

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct Example { vector<double> x; int y; };
5
6	// Sigmoid function
7	static inline double sigmoid(double z) { return 1.0 / (1.0 + exp(-z)); }
8
9	// Compute logistic (binary cross-entropy) loss given parameters theta = [w1, w2, ..., b]
10	double logistic_loss(const vector<Example>& data, const vector<double>& theta) {
11	int p = (int)theta.size() - 1; // last element is bias b
12	double loss = 0.0;
13	for (const auto& e : data) {
14	double z = theta.back(); // bias
15	for (int j = 0; j < p; ++j) z += theta[j] * e.x[j];
16	double p1 = sigmoid(z);
17	p1 = min(max(p1, 1e-12), 1.0 - 1e-12); // numeric safety
18	loss += - (e.y ? log(p1) : log(1.0 - p1));
19	}
20	return loss / data.size();
21	}
22
23	// Gradient of logistic loss
24	vector<double> logistic_grad(const vector<Example>& data, const vector<double>& theta) {
25	int p = (int)theta.size() - 1;
26	vector<double> g(p + 1, 0.0);
27	for (const auto& e : data) {
28	double z = theta.back();
29	for (int j = 0; j < p; ++j) z += theta[j] * e.x[j];
30	double p1 = sigmoid(z);
31	double diff = (p1 - e.y); // derivative of BCE wrt z
32	for (int j = 0; j < p; ++j) g[j] += diff * e.x[j];
33	g.back() += diff; // bias term
34	}
35	for (double& v : g) v /= data.size();
36	return g;
37	}
38
39	// Train logistic regression via gradient descent
40	vector<double> train_logreg(const vector<Example>& data, int p, int iters=2000, double lr=0.5) {
41	vector<double> theta(p + 1, 0.0); // initialize to zeros
42	for (int t = 0; t < iters; ++t) {
43	vector<double> g = logistic_grad(data, theta);
44	double eta = lr / sqrt(1.0 + t * 0.01); // mild decay
45	for (int j = 0; j <= p; ++j) theta[j] -= eta * g[j];
46	}
47	return theta;
48	}
49
50	// Orthonormalize two random directions using Gram-Schmidt
51	pair<vector<double>, vector<double>> two_orthonormal_directions(int dim, std::mt19937& rng) {
52	std::normal_distribution<double> N(0.0, 1.0);
53	vector<double> d1(dim), d2(dim);
54	for (int i = 0; i < dim; ++i) { d1[i] = N(rng); d2[i] = N(rng); }
55	auto norm = [](const vector<double>& v){ double s=0; for(double x:v) s+=x*x; return sqrt(max(1e-18, s)); };
56	auto dot = [](const vector<double>& a, const vector<double>& b){ double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; };
57
58	// Normalize d1
59	double n1 = norm(d1); for (double& x : d1) x /= n1;
60	// Make d2 orthogonal to d1
61	double proj = dot(d2, d1);
62	for (int i = 0; i < dim; ++i) d2[i] -= proj * d1[i];
63	// Normalize d2
64	double n2 = norm(d2); for (double& x : d2) x /= n2;
65	return {d1, d2};
66	}
67
68	int main() {
69	ios::sync_with_stdio(false);
70	cin.tie(nullptr);
71
72	// 1) Create a separable 2D dataset (two Gaussian blobs)
73	std::mt19937 rng(123);
74	int n_per_class = 200;
75	vector<Example> data; data.reserve(2 * n_per_class);
76	normal_distribution<double> A1( 2.0, 1.0), A2( 2.0, 1.0);
77	normal_distribution<double> B1(-2.0, 1.0), B2(-2.0, 1.0);
78	for (int i = 0; i < n_per_class; ++i) {
79	data.push_back({{A1(rng), A2(rng)}, 1});
80	data.push_back({{B1(rng), B2(rng)}, 0});
81	}
82
83	// 2) Train logistic regression (parameters: w1, w2, b)
84	int p = 2;
85	vector<double> theta = train_logreg(data, p, 1500, 0.4);
86	cerr << fixed << setprecision(6);
87	cerr << "Trained loss = " << logistic_loss(data, theta) << "\n";
88
89	// 3) Build two orthonormal directions in R^3 around theta
90	auto [d1, d2] = two_orthonormal_directions(p + 1, rng);
91
92	// 4) Evaluate 2D slice g(alpha, beta) = L(theta + alphad1 + betad2)
93	int G = 101; // grid resolution per axis
94	double r = 1.5; // radius along each direction
95	ofstream out("logreg_slice.csv");
96	out << "alpha,beta,loss\n";
97	for (int i = 0; i < G; ++i) {
98	double alpha = -r + 2r i / (G - 1);
99	for (int j = 0; j < G; ++j) {
100	double beta = -r + 2r j / (G - 1);
101	vector<double> th = theta;
102	for (int k = 0; k <= p; ++k) th[k] += alpha * d1[k] + beta * d2[k];
103	double L = logistic_loss(data, th);
104	out << alpha << "," << beta << "," << L << "\n";
105	}
106	}
107	out.close();
108	cerr << "2D slice written to logreg_slice.csv (columns: alpha,beta,loss)." << "\n";
109
110	return 0;
111	}
112