📚TheoryAdvanced

Reproducing Kernel Hilbert Spaces (RKHS)

Key Points

•
An RKHS is a space of functions where evaluating a function at a point equals taking an inner product with a kernel section, which enables the “kernel trick.”
•
A positive definite kernel k(x,y) uniquely defines an RKHS and its inner product via the Moore–Aronszajn theorem.
•
Learning in RKHS often reduces to working with the n×n Gram matrix K where $K_{ij} = k (x_{i}$ , $x_{j}$ ); this enables non-linear learning with linear algebra.
•
The Representer Theorem guarantees many regularized learning problems have solutions of the form f(·) = $\sum_{i}$ $α_{i}$ k( $x_{i}$ ,·).
•
Kernel Ridge Regression (KRR) solves (K + $λ$ I) $α$ = y and predicts by f(x) = k(x,X)^T $α$ .
•
Using Cholesky instead of matrix inversion yields numerically stable and typically faster solutions for kernel methods.
•
Random Fourier Features approximate shift-invariant kernels like the RBF, reducing O( $n^{2}$ ) memory to O(nD), where D is the feature dimension.
•
Choosing kernel hyperparameters (e.g., RBF bandwidth $γ$ ) and regularization $λ$ critically affects generalization and conditioning.

Prerequisites

→Linear Algebra (vectors, inner products, PSD matrices) — RKHS relies on inner products, Gram matrices, and factorizations like Cholesky.
→Calculus and Real Analysis (basic) — Understanding function spaces, continuity, and norms benefits from calculus foundations.
→Probability (Gaussians, expectations) — Bochner’s theorem, Random Fourier Features, and Gaussian Processes use probabilistic tools.
→Optimization (convexity, regularization) — RKHS learning problems are regularized ERM with convex objectives.
→Numerical Linear Algebra — Efficient and stable solutions require solving linear systems without explicit inversion.

Detailed Explanation

Tap terms for definitions

01Overview

Hook: Imagine plotting data points in 2D that are not linearly separable. You wish you could lift them into some higher dimension where a straight line would work. Reproducing Kernel Hilbert Spaces (RKHS) let you do exactly that—without ever computing the lift directly. Concept: An RKHS is a Hilbert space of functions tied to a positive definite kernel k(x, y). The kernel plays double duty: it defines an implicit feature map \phi and the inner product in the function space through k(x, y) = \langle \phi(x), \phi(y) \rangle. Because we only need k to compute inner products, we can learn flexible, non-linear models using simple linear algebra on the $n×n$ Gram matrix. Example: In kernel ridge regression, the best-fitting function under a smoothness penalty has the form f(·) = \su $m_i$ \alph $a_i$ k( $x_i$ , ·). The coefficients \alpha come from solving one linear system involving the Gram matrix, avoiding explicit high-dimensional features entirely.

02Intuition & Analogies

Think of a music equalizer. Raw sound is complicated, but sliding the equalizer knobs emphasizes certain frequencies so the music sounds better. Similarly, many datasets are tangled in their original space. A kernel is like a smart equalizer that implicitly maps inputs into a (possibly infinite-dimensional) space where the data becomes simpler—often linearly separable or easier to approximate. You never see the knobs (the feature map \phi); you just hear the result (the kernel value k(x, y) that acts as an inner product there). Another analogy: Suppose you want to measure how similar two recipes are, but comparing ingredients one by one is tedious. Instead, you could ask a chef to give you a single similarity score that encodes their expert judgment of overlap in flavor, technique, and style. That chef’s score is your kernel. It saves you from manually enumerating and aligning features. The ‘reproducing’ part means evaluating a function at x is the same as taking an inner product with a special function anchored at x, k(·, x). So to know f(x), you just correlate f with a kernel bump centered at x. This viewpoint explains why many regularized learning solutions boil down to weighted sums of such bumps around training points, letting us sculpt complex functions by adding or reducing the influence of each point.

03Formal Definition

Let X be a nonempty set and k: X × X →

R

be a symmetric, positive definite kernel: for any finite \{

x_{i}

\}_{

i=1}^n

\subset

X and any coefficients

c_{1}

\dots

c_{n}

, we have

\sum_{i, j = 1}^{n}

c_{i}

c_{j}

x_{i}

x_{j}

)

\geq

0. The Moore–Aronszajn theorem states there exists a unique Hilbert space

H

of functions f: X →

R

with inner product

⟨

\cdot

\cdot

⟩_{H}

such that: (1) For every x

\in

X, k(

\cdot

, x)

\in

H

. (2) Reproducing property: for all f

\in

H

and x

\in

X, f(x) =

⟨

f, k(

\cdot

, x)

⟩_{H}

. The kernel determines

H

and its inner product; conversely, any RKHS has such a kernel. For a dataset X = (

x_{1}

\dots

x_{n}

), define the Gram matrix K

\in

R^{n \times n}

K_{ij} = k (x_{i}

x_{j}

); K is symmetric positive semidefinite. Many regularized empirical risk minimization problems over

H

, e.g., minimizing

\sum_{i}

ℓ

(

y_{i}

, f(

x_{i}

)) +

λ

∥

∥_{H}^{2}

, admit solutions of the form f(·) =

\sum_{i = 1}^{n}

α_{i}

x_{i}

, ·) (Representer Theorem).

04When to Use

Nonlinear patterns with limited to medium-sized datasets (e.g., n up to a few tens of thousands) where flexibility matters but you still can form an $n×n$ Gram matrix. - When you need theoretically grounded similarity functions and guarantees (positive definiteness), e.g., SVMs, Gaussian Processes, Kernel Ridge Regression, Kernel PCA, and MMD-based two-sample testing. - When features are hard to engineer but a good similarity can be defined (strings, graphs, distributions) via custom kernels. - When you want convex training with global optima (e.g., KRR or SVMs) instead of deep non-convex optimization. - When interpretability in terms of influential training points is useful; solutions are sums of kernel sections k(·, $x_i$ ). Avoid pure kernel methods when n is huge and memory/time for O( $n^2$ ) or O( $n^3$ ) operations is prohibitive; consider approximations like Random Fourier Features or Nyström in that case.

⚠️Common Mistakes

Using kernels that are not positive semidefinite, yielding invalid Gram matrices and unstable training. Always test PSD via Cholesky factorization; add small jitter if needed. - Explicitly inverting (K + \lambda I). Numerical inversion is unstable and slower; solve linear systems via Cholesky or CG. - Mis-tuning hyperparameters: RBF bandwidth \gamma (or \sigma) and regularization \lambda. Too small \gamma overfits with spiky kernels; too large \gamma underfits. Cross-validate on a log-scale grid. - Ignoring feature scaling. RBF kernels are very sensitive to input scales; standardize features. - Forgetting regularization (\lambda = 0) with noisy data leads to ill-conditioning and overfitting. - Confusing kernel trick with arbitrary similarity. Not every similarity is a valid kernel; require PSD. - For kernel PCA, forgetting to center the Gram matrix in feature space causes incorrect components. - Using too few Random Fourier Features (D too small), leading to high variance approximations; increase D until kernel approximation error is acceptable.

Key Formulas

Reproducing Property

f (x) = ⟨ f, k (\cdot, x) ⟩_{H}

Explanation: In an RKHS, evaluating a function at x equals its inner product with the kernel section centered at x. This ties evaluation to geometry in the function space.

Gram Matrix

K_{ij} = k (x_{i}, x_{j})

Explanation: The Gram matrix collects all pairwise kernel evaluations among training points. It is the main object used by kernel algorithms.

Positive Definiteness of Kernel

i = 1 \sum n j = 1 \sum n c_{i} c_{j} k (x_{i}, x_{j}) \geq 0

Explanation: A valid kernel yields nonnegative quadratic forms for any coefficients c. Equivalently, every Gram matrix built from the kernel is positive semidefinite.

Representer Theorem Form

f^{*} (\cdot) = i = 1 \sum n α_{i} k (x_{i}, \cdot)

Explanation: For many regularized learning problems in an RKHS, the optimal solution is a finite linear combination of kernel sections at training points.

KRR Dual Solution

α = (K + λ I)^{- 1} y

Explanation: Kernel ridge regression solves a linear system in the Gram matrix to find coefficients. In practice, avoid forming the inverse; solve via Cholesky.

KRR Prediction

f (x) = k (x, X)^{⊤} α = i = 1 \sum n α_{i} k (x, x_{i})

Explanation: Once $α$ is known, predictions are kernel-weighted sums over training points. Complexity is linear in the number of training samples.

RBF (Gaussian) Kernel

k (x, y) = exp (- γ ∥ x - y ∥^{2})

Explanation: A popular, smooth, shift-invariant kernel with bandwidth parameter $γ$ . Small $γ$ spreads influence broadly; large $γ$ makes it very local.

Polynomial Kernel

k (x, y) = (x^{⊤} y + c)^{p}

Explanation: Encodes all monomials up to degree p (and interaction terms) implicitly. The offset c shifts the feature space.

Mercer Expansion

k (x, y) = i = 1 \sum \infty λ_{i} ϕ_{i} (x) ϕ_{i} (y), λ_{i} \geq 0

Explanation: A PSD kernel can be expanded in orthonormal eigenfunctions with nonnegative weights. This provides a spectral view of the RKHS.

Bochner Representation

k (x - y) = \int_{R^{d}} e^{i w^{⊤} (x - y)} p (w) d w

Explanation: Any continuous, shift-invariant PSD kernel is the Fourier transform of a nonnegative measure. Sampling w from p(w) leads to Random Fourier Features.

KRR Objective

f \in H min \frac{1}{n} i = 1 \sum n (y_{i} - f (x_{i}))^{2} + λ ∥ f ∥_{H}^{2}

Explanation: Balances data fit (least squares) against RKHS norm $\frac{s m oo t hn ess}{co m pl e x i t y}$ . The solution is unique and given by the dual coefficients.

Complexity Analysis

Core RKHS algorithms rely on the n×n Gram matrix K. Building K requires O(

n^{2}

d) time for d-dimensional inputs and O(

n^{2}

) memory. Kernel Ridge Regression (KRR) solves (K +

λ I) α

= y. A dense Cholesky factorization costs O(

n^{3}

) time and O(

n^{2}

) space, with subsequent triangular solves costing O(

n^{2}

). Prediction for a single test point costs O(n

d_{k}

) to form the kernel vector k(x, X) and O(n) to take a dot-product with α (often the kernel evaluation dominates). If many test points are needed, total prediction time is O(n m

d_{k}

) for m tests. Numerical stability is enhanced by adding a small jitter ε to the diagonal and using Cholesky instead of explicit inversion, avoiding O(

n^{3}

) inverse computation and reducing rounding errors. The Random Fourier Features (RFF) approximation replaces the kernel with a D-dimensional explicit feature map z(x). Building z(x) costs O(D d), training a linear model is O(n

D^{2}

) (closed-form ridge) or O(n D) per pass with SGD, and memory drops to O(n D). For fixed accuracy, D typically scales with the intrinsic complexity of the kernel and data; increasing D reduces approximation variance but increases compute. PSD checks via Cholesky attempt cost O(

n^{3}

); they are practical for moderate n but not for very large datasets. In summary: exact kernel methods are best for small-to-medium n; approximations like RFF or Nyström are needed when n is large to keep both time and memory manageable.

Code Examples

Kernel utilities: define kernels, build Gram matrix, and test PSD via Cholesky

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 using Vec = vector<double>;
5 using Mat = vector<vector<double>>;
6 using Kernel = function<double(const Vec&, const Vec&)>;
7 
8 // Basic helpers
9 static double dot(const Vec& a, const Vec& b) {
10     double s = 0.0; for (size_t i = 0; i < a.size(); ++i) s += a[i]*b[i]; return s;
11 }
12 static double sqL2(const Vec& a, const Vec& b) {
13     double s = 0.0; for (size_t i = 0; i < a.size(); ++i) { double d = a[i]-b[i]; s += d*d; } return s;
14 }
15 
16 // Common kernels
17 Kernel linearKernel = [](const Vec& x, const Vec& y){ return dot(x,y); };
18 Kernel polynomialKernel(double c, int p) {
19     return [c,p](const Vec& x, const Vec& y){ return pow(dot(x,y) + c, p); };
20 }
21 Kernel rbfKernel(double gamma) {
22     return [gamma](const Vec& x, const Vec& y){ return exp(-gamma * sqL2(x,y)); };
23 }
24 
25 Mat gramMatrix(const vector<Vec>& X, const Kernel& k) {
26     size_t n = X.size();
27     Mat K(n, Vec(n, 0.0));
28     for (size_t i = 0; i < n; ++i) {
29         K[i][i] = k(X[i], X[i]);
30         for (size_t j = i+1; j < n; ++j) {
31             double v = k(X[i], X[j]);
32             K[i][j] = v; K[j][i] = v;
33         }
34     }
35     return K;
36 }
37 
38 // Cholesky factorization for SPD matrices: A = L L^T
39 bool cholesky(const Mat& A, Mat& L) {
40     size_t n = A.size();
41     L.assign(n, Vec(n, 0.0));
42     for (size_t i = 0; i < n; ++i) {
43         for (size_t j = 0; j <= i; ++j) {
44             double sum = A[i][j];
45             for (size_t k = 0; k < j; ++k) sum -= L[i][k] * L[j][k];
46             if (i == j) {
47                 if (sum <= 0.0 || !isfinite(sum)) return false; // not SPD
48                 L[i][i] = sqrt(max(sum, 0.0));
49             } else {
50                 if (L[j][j] == 0.0) return false;
51                 L[i][j] = sum / L[j][j];
52             }
53         }
54     }
55     return true;
56 }
57 
58 // Solve (L L^T) x = b with forward/back substitution
59 Vec solveWithCholesky(const Mat& L, const Vec& b) {
60     size_t n = L.size();
61     Vec y(n, 0.0), x(n, 0.0);
62     // Forward: L y = b
63     for (size_t i = 0; i < n; ++i) {
64         double sum = b[i];
65         for (size_t k = 0; k < i; ++k) sum -= L[i][k] * y[k];
66         y[i] = sum / L[i][i];
67     }
68     // Backward: L^T x = y
69     for (int i = (int)n-1; i >= 0; --i) {
70         double sum = y[i];
71         for (size_t k = i+1; k < n; ++k) sum -= L[k][i] * x[k];
72         x[i] = sum / L[i][i];
73     }
74     return x;
75 }
76 
77 // PSD test by trying Cholesky with small jitter if necessary
78 bool isPSD(Mat K) {
79     size_t n = K.size();
80     Mat L;
81     if (cholesky(K, L)) return true;
82     // Try adding jitter progressively
83     double eps = 1e-12;
84     for (int t = 0; t < 6; ++t) {
85         for (size_t i = 0; i < n; ++i) K[i][i] += eps;
86         if (cholesky(K, L)) return true;
87         eps *= 10.0;
88     }
89     return false;
90 }
91 
92 int main(){
93     // Toy 2D dataset
94     vector<Vec> X = { {0.0, 0.0}, {1.0, 0.5}, {0.5, 1.5}, {2.0, 2.0} };
95 
96     // Define kernels
97     auto lin = linearKernel;
98     auto poly = polynomialKernel(1.0, 3);    // (x^T y + 1)^3
99     auto rbf = rbfKernel(0.5);               // gamma = 0.5
100 
101     // Build Gram matrices
102     Mat K_lin = gramMatrix(X, lin);
103     Mat K_poly = gramMatrix(X, poly);
104     Mat K_rbf = gramMatrix(X, rbf);
105 
106     cout << boolalpha;
107     cout << "Linear kernel PSD? " << isPSD(K_lin) << "\n";
108     cout << "Polynomial kernel PSD? " << isPSD(K_poly) << "\n";
109     cout << "RBF kernel PSD? " << isPSD(K_rbf) << "\n";
110 
111     // Print a Gram matrix
112     cout << "RBF Gram matrix:\n";
113     cout.setf(std::ios::fixed); cout<<setprecision(4);
114     for (auto& row : K_rbf) {
115         for (double v : row) cout << setw(9) << v << ' ';
116         cout << '\n';
117     }
118     return 0;
119 }
120

This program defines common kernels (linear, polynomial, RBF), builds the Gram matrix for a small dataset, and checks positive semidefiniteness by attempting a Cholesky factorization (with diagonal jitter for numerical robustness). A successful Cholesky factorization is a practical PSD certificate for Gram matrices.

Time: Building K: O(n^2 d). Cholesky factorization: O(n^3).Space: O(n^2) for the Gram matrix and O(n^2) for the Cholesky factor.

Kernel Ridge Regression (KRR) with Cholesky solver

1 #include <bits/stdc++.h>
2 using namespace std;
3 using Vec = vector<double>; using Mat = vector<vector<double>>;
4 using Kernel = function<double(const Vec&, const Vec&)>;
5 
6 static double sqL2(const Vec& a, const Vec& b){ double s=0; for(size_t i=0;i<a.size();++i){ double d=a[i]-b[i]; s+=d*d;} return s; }
7 Kernel rbfKernel(double gamma){ return [gamma](const Vec& x, const Vec& y){ return exp(-gamma*sqL2(x,y)); }; }
8 
9 Mat gramMatrix(const vector<Vec>& X, const Kernel& k){ size_t n=X.size(); Mat K(n, Vec(n,0.0)); for(size_t i=0;i<n;++i){ for(size_t j=0;j<=i;++j){ double v=k(X[i],X[j]); K[i][j]=v; K[j][i]=v; } } return K; }
10 
11 bool cholesky(const Mat& A, Mat& L){ size_t n=A.size(); L.assign(n, Vec(n,0.0)); for(size_t i=0;i<n;++i){ for(size_t j=0;j<=i;++j){ double sum=A[i][j]; for(size_t k=0;k<j;++k) sum-=L[i][k]*L[j][k]; if(i==j){ if(sum<=0.0||!isfinite(sum)) return false; L[i][i]=sqrt(max(sum,0.0)); } else { if(L[j][j]==0.0) return false; L[i][j]=sum/L[j][j]; } } } return true; }
12 Vec solveWithCholesky(const Mat& L, const Vec& b){ size_t n=L.size(); Vec y(n,0.0), x(n,0.0); for(size_t i=0;i<n;++i){ double s=b[i]; for(size_t k=0;k<i;++k) s-=L[i][k]*y[k]; y[i]=s/L[i][i]; } for(int i=(int)n-1;i>=0;--i){ double s=y[i]; for(size_t k=i+1;k<n;++k) s-=L[k][i]*x[k]; x[i]=s/L[i][i]; } return x; }
13 
14 struct KernelRidgeRegression {
15     Kernel k; double lambda; vector<Vec> Xtrain; Vec alpha; bool fitted=false;
16     KernelRidgeRegression(Kernel kernel, double lam): k(kernel), lambda(lam) {}
17     void fit(const vector<Vec>& X, const Vec& y){
18         Xtrain = X; size_t n = X.size();
19         // Build K and add regularization to diagonal
20         Mat K = gramMatrix(X, k);
21         for(size_t i=0;i<n;++i) K[i][i] += lambda;
22         // Cholesky and solve
23         Mat L; bool ok = cholesky(K, L);
24         if(!ok){ // add jitter if needed
25             for(size_t i=0;i<n;++i) K[i][i] += 1e-10; ok = cholesky(K,L);
26             if(!ok) throw runtime_error("Cholesky failed: kernel may be non-PSD or ill-conditioned");
27         }
28         alpha = solveWithCholesky(L, y);
29         fitted = true;
30     }
31     double predict_one(const Vec& x) const{
32         if(!fitted) throw runtime_error("Model not fitted");
33         double s=0.0; for(size_t i=0;i<Xtrain.size();++i) s += alpha[i]*k(x, Xtrain[i]); return s;
34     }
35     vector<double> predict(const vector<Vec>& X) const{ vector<double> out; out.reserve(X.size()); for(const auto& x: X) out.push_back(predict_one(x)); return out; }
36 };
37 
38 int main(){
39     // Generate 1D regression data: y = sin(2*pi*x) + noise
40     int n = 80; std::mt19937 rng(42); std::uniform_real_distribution<double> U(0.0,1.0); std::normal_distribution<double> N(0.0, 0.1);
41     vector<Vec> X; Vec y; X.reserve(n); y.reserve(n);
42     for(int i=0;i<n;++i){ double xi = U(rng); X.push_back({xi}); y.push_back(sin(2*M_PI*xi) + N(rng)); }
43 
44     auto k = rbfKernel(50.0); // gamma; sensitive to input scale—here x in [0,1]
45     KernelRidgeRegression krr(k, 1e-2);
46     krr.fit(X, y);
47 
48     // Predict on a grid and compute MSE on training data
49     double mse = 0.0; for(int i=0;i<n;++i){ double yi_hat = krr.predict_one(X[i]); double e = yi_hat - y[i]; mse += e*e; }
50     mse /= n; cout.setf(std::ios::fixed); cout<<setprecision(6);
51     cout << "Training MSE: " << mse << "\n";
52 
53     // Show a few predictions
54     for(double xq : {0.0, 0.25, 0.5, 0.75, 1.0}){
55         double pred = krr.predict_one(Vec{ xq });
56         cout << "x=" << setw(5) << xq << "  f(x)≈ " << setw(9) << pred << '\n';
57     }
58     return 0;
59 }
60

This implements Kernel Ridge Regression using the RBF kernel. Training builds the Gram matrix, adds λ to the diagonal, performs a Cholesky factorization, and solves for α. Predictions are kernel-weighted sums over training points. The example fits a noisy sin curve, showing the power of RKHS-based smooth regression.

Time: Training: O(n^2 d) to build K plus O(n^3) for Cholesky. Prediction per query: O(n d) for kernel vector and O(n) dot-product.Space: O(n^2) for K and O(n) for α.

Random Fourier Features (RFF) for approximating the RBF kernel

1 #include <bits/stdc++.h>
2 using namespace std;
3 using Vec = vector<double>;
4 
5 // RFF transformer for RBF kernel: z(x) in R^D with z(x)^T z(y) ≈ k(x,y)
6 struct RFF {
7     int D; int d; double gamma; // RBF: k(x,y)=exp(-gamma||x-y||^2)
8     vector<Vec> W; // D x d Gaussian rows
9     vector<double> b; // D phases in [0, 2pi]
10     std::mt19937 rng;
11 
12     RFF(int D_, int d_, double gamma_, uint32_t seed=42): D(D_), d(d_), gamma(gamma_), rng(seed) {
13         // For RBF with parameter gamma, spectral density is Gaussian with variance 2*gamma
14         double sigma2 = 2.0 * gamma; // w ~ N(0, 2*gamma I)
15         std::normal_distribution<double> N(0.0, sqrt(sigma2));
16         std::uniform_real_distribution<double> U(0.0, 2.0*M_PI);
17         W.assign(D, Vec(d, 0.0)); b.assign(D, 0.0);
18         for (int i = 0; i < D; ++i) {
19             for (int j = 0; j < d; ++j) W[i][j] = N(rng);
20             b[i] = U(rng);
21         }
22     }
23 
24     Vec transform(const Vec& x) const {
25         Vec z(D, 0.0);
26         double scale = sqrt(2.0 / D);
27         for (int i = 0; i < D; ++i) {
28             double proj = 0.0; for (int j = 0; j < d; ++j) proj += W[i][j] * x[j];
29             z[i] = scale * cos(proj + b[i]);
30         }
31         return z;
32     }
33 };
34 
35 static double sqL2(const Vec& a, const Vec& b) { double s=0; for(size_t i=0;i<a.size();++i){ double d=a[i]-b[i]; s+=d*d; } return s; }
36 static double rbf_exact(const Vec& x, const Vec& y, double gamma){ return exp(-gamma * sqL2(x,y)); }
37 
38 int main(){
39     // 3D random points
40     int n = 1000; int d = 3; double gamma = 1.0; std::mt19937 rng(123);
41     std::normal_distribution<double> N(0.0, 1.0);
42     vector<Vec> X(n, Vec(d,0.0)); for(int i=0;i<n;++i){ for(int j=0;j<d;++j) X[i][j] = N(rng); }
43 
44     // Build RFF transformer with varying D
45     for (int D : {100, 500, 2000}) {
46         RFF rff(D, d, gamma, 7);
47         // Estimate kernel approximation error over random pairs
48         std::uniform_int_distribution<int> UI(0, n-1);
49         double mae = 0.0; int trials = 2000;
50         for (int t = 0; t < trials; ++t) {
51             int i = UI(rng), j = UI(rng);
52             Vec zx = rff.transform(X[i]);
53             Vec zy = rff.transform(X[j]);
54             double approx = inner_product(zx.begin(), zx.end(), zy.begin(), 0.0);
55             double exact = rbf_exact(X[i], X[j], gamma);
56             mae += fabs(approx - exact);
57         }
58         mae /= trials;
59         cout.setf(std::ios::fixed); cout<<setprecision(6);
60         cout << "D=" << setw(5) << D << "  Mean |approx-exact| = " << mae << "\n";
61     }
62     return 0;
63 }
64

This demonstrates Random Fourier Features for the RBF kernel. It samples Gaussian frequencies according to Bochner’s theorem and maps x to z(x) so that z(x)^T z(y) approximates k(x,y). Increasing D improves the approximation while keeping memory linear in D instead of n.

Time: Transform per point: O(D d). Building features for n points: O(n D d).Space: O(D d) to store W and O(D) for b; storing transformed data costs O(n D).

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	using Vec = vector<double>;
5	using Mat = vector<vector<double>>;
6	using Kernel = function<double(const Vec&, const Vec&)>;
7
8	// Basic helpers
9	static double dot(const Vec& a, const Vec& b) {
10	double s = 0.0; for (size_t i = 0; i < a.size(); ++i) s += a[i]*b[i]; return s;
11	}
12	static double sqL2(const Vec& a, const Vec& b) {
13	double s = 0.0; for (size_t i = 0; i < a.size(); ++i) { double d = a[i]-b[i]; s += d*d; } return s;
14	}
15
16	// Common kernels
17	Kernel linearKernel = [](const Vec& x, const Vec& y){ return dot(x,y); };
18	Kernel polynomialKernel(double c, int p) {
19	return [c,p](const Vec& x, const Vec& y){ return pow(dot(x,y) + c, p); };
20	}
21	Kernel rbfKernel(double gamma) {
22	return [gamma](const Vec& x, const Vec& y){ return exp(-gamma * sqL2(x,y)); };
23	}
24
25	Mat gramMatrix(const vector<Vec>& X, const Kernel& k) {
26	size_t n = X.size();
27	Mat K(n, Vec(n, 0.0));
28	for (size_t i = 0; i < n; ++i) {
29	K[i][i] = k(X[i], X[i]);
30	for (size_t j = i+1; j < n; ++j) {
31	double v = k(X[i], X[j]);
32	K[i][j] = v; K[j][i] = v;
33	}
34	}
35	return K;
36	}
37
38	// Cholesky factorization for SPD matrices: A = L L^T
39	bool cholesky(const Mat& A, Mat& L) {
40	size_t n = A.size();
41	L.assign(n, Vec(n, 0.0));
42	for (size_t i = 0; i < n; ++i) {
43	for (size_t j = 0; j <= i; ++j) {
44	double sum = A[i][j];
45	for (size_t k = 0; k < j; ++k) sum -= L[i][k] * L[j][k];
46	if (i == j) {
47	if (sum <= 0.0 \|\| !isfinite(sum)) return false; // not SPD
48	L[i][i] = sqrt(max(sum, 0.0));
49	} else {
50	if (L[j][j] == 0.0) return false;
51	L[i][j] = sum / L[j][j];
52	}
53	}
54	}
55	return true;
56	}
57
58	// Solve (L L^T) x = b with forward/back substitution
59	Vec solveWithCholesky(const Mat& L, const Vec& b) {
60	size_t n = L.size();
61	Vec y(n, 0.0), x(n, 0.0);
62	// Forward: L y = b
63	for (size_t i = 0; i < n; ++i) {
64	double sum = b[i];
65	for (size_t k = 0; k < i; ++k) sum -= L[i][k] * y[k];
66	y[i] = sum / L[i][i];
67	}
68	// Backward: L^T x = y
69	for (int i = (int)n-1; i >= 0; --i) {
70	double sum = y[i];
71	for (size_t k = i+1; k < n; ++k) sum -= L[k][i] * x[k];
72	x[i] = sum / L[i][i];
73	}
74	return x;
75	}
76
77	// PSD test by trying Cholesky with small jitter if necessary
78	bool isPSD(Mat K) {
79	size_t n = K.size();
80	Mat L;
81	if (cholesky(K, L)) return true;
82	// Try adding jitter progressively
83	double eps = 1e-12;
84	for (int t = 0; t < 6; ++t) {
85	for (size_t i = 0; i < n; ++i) K[i][i] += eps;
86	if (cholesky(K, L)) return true;
87	eps *= 10.0;
88	}
89	return false;
90	}
91
92	int main(){
93	// Toy 2D dataset
94	vector<Vec> X = { {0.0, 0.0}, {1.0, 0.5}, {0.5, 1.5}, {2.0, 2.0} };
95
96	// Define kernels
97	auto lin = linearKernel;
98	auto poly = polynomialKernel(1.0, 3); // (x^T y + 1)^3
99	auto rbf = rbfKernel(0.5); // gamma = 0.5
100
101	// Build Gram matrices
102	Mat K_lin = gramMatrix(X, lin);
103	Mat K_poly = gramMatrix(X, poly);
104	Mat K_rbf = gramMatrix(X, rbf);
105
106	cout << boolalpha;
107	cout << "Linear kernel PSD? " << isPSD(K_lin) << "\n";
108	cout << "Polynomial kernel PSD? " << isPSD(K_poly) << "\n";
109	cout << "RBF kernel PSD? " << isPSD(K_rbf) << "\n";
110
111	// Print a Gram matrix
112	cout << "RBF Gram matrix:\n";
113	cout.setf(std::ios::fixed); cout<<setprecision(4);
114	for (auto& row : K_rbf) {
115	for (double v : row) cout << setw(9) << v << ' ';
116	cout << '\n';
117	}
118	return 0;
119	}
120

1	#include <bits/stdc++.h>
2	using namespace std;
3	using Vec = vector<double>; using Mat = vector<vector<double>>;
4	using Kernel = function<double(const Vec&, const Vec&)>;
5
6	static double sqL2(const Vec& a, const Vec& b){ double s=0; for(size_t i=0;i<a.size();++i){ double d=a[i]-b[i]; s+=d*d;} return s; }
7	Kernel rbfKernel(double gamma){ return [gamma](const Vec& x, const Vec& y){ return exp(-gamma*sqL2(x,y)); }; }
8
9	Mat gramMatrix(const vector<Vec>& X, const Kernel& k){ size_t n=X.size(); Mat K(n, Vec(n,0.0)); for(size_t i=0;i<n;++i){ for(size_t j=0;j<=i;++j){ double v=k(X[i],X[j]); K[i][j]=v; K[j][i]=v; } } return K; }
10
11	bool cholesky(const Mat& A, Mat& L){ size_t n=A.size(); L.assign(n, Vec(n,0.0)); for(size_t i=0;i<n;++i){ for(size_t j=0;j<=i;++j){ double sum=A[i][j]; for(size_t k=0;k<j;++k) sum-=L[i][k]*L[j][k]; if(i==j){ if(sum<=0.0\|\|!isfinite(sum)) return false; L[i][i]=sqrt(max(sum,0.0)); } else { if(L[j][j]==0.0) return false; L[i][j]=sum/L[j][j]; } } } return true; }
12	Vec solveWithCholesky(const Mat& L, const Vec& b){ size_t n=L.size(); Vec y(n,0.0), x(n,0.0); for(size_t i=0;i<n;++i){ double s=b[i]; for(size_t k=0;k<i;++k) s-=L[i][k]y[k]; y[i]=s/L[i][i]; } for(int i=(int)n-1;i>=0;--i){ double s=y[i]; for(size_t k=i+1;k<n;++k) s-=L[k][i]x[k]; x[i]=s/L[i][i]; } return x; }
13
14	struct KernelRidgeRegression {
15	Kernel k; double lambda; vector<Vec> Xtrain; Vec alpha; bool fitted=false;
16	KernelRidgeRegression(Kernel kernel, double lam): k(kernel), lambda(lam) {}
17	void fit(const vector<Vec>& X, const Vec& y){
18	Xtrain = X; size_t n = X.size();
19	// Build K and add regularization to diagonal
20	Mat K = gramMatrix(X, k);
21	for(size_t i=0;i<n;++i) K[i][i] += lambda;
22	// Cholesky and solve
23	Mat L; bool ok = cholesky(K, L);
24	if(!ok){ // add jitter if needed
25	for(size_t i=0;i<n;++i) K[i][i] += 1e-10; ok = cholesky(K,L);
26	if(!ok) throw runtime_error("Cholesky failed: kernel may be non-PSD or ill-conditioned");
27	}
28	alpha = solveWithCholesky(L, y);
29	fitted = true;
30	}
31	double predict_one(const Vec& x) const{
32	if(!fitted) throw runtime_error("Model not fitted");
33	double s=0.0; for(size_t i=0;i<Xtrain.size();++i) s += alpha[i]*k(x, Xtrain[i]); return s;
34	}
35	vector<double> predict(const vector<Vec>& X) const{ vector<double> out; out.reserve(X.size()); for(const auto& x: X) out.push_back(predict_one(x)); return out; }
36	};
37
38	int main(){
39	// Generate 1D regression data: y = sin(2pix) + noise
40	int n = 80; std::mt19937 rng(42); std::uniform_real_distribution<double> U(0.0,1.0); std::normal_distribution<double> N(0.0, 0.1);
41	vector<Vec> X; Vec y; X.reserve(n); y.reserve(n);
42	for(int i=0;i<n;++i){ double xi = U(rng); X.push_back({xi}); y.push_back(sin(2M_PIxi) + N(rng)); }
43
44	auto k = rbfKernel(50.0); // gamma; sensitive to input scale—here x in [0,1]
45	KernelRidgeRegression krr(k, 1e-2);
46	krr.fit(X, y);
47
48	// Predict on a grid and compute MSE on training data
49	double mse = 0.0; for(int i=0;i<n;++i){ double yi_hat = krr.predict_one(X[i]); double e = yi_hat - y[i]; mse += e*e; }
50	mse /= n; cout.setf(std::ios::fixed); cout<<setprecision(6);
51	cout << "Training MSE: " << mse << "\n";
52
53	// Show a few predictions
54	for(double xq : {0.0, 0.25, 0.5, 0.75, 1.0}){
55	double pred = krr.predict_one(Vec{ xq });
56	cout << "x=" << setw(5) << xq << " f(x)≈ " << setw(9) << pred << '\n';
57	}
58	return 0;
59	}
60