🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
📚TheoryAdvanced

Reproducing Kernel Hilbert Spaces (RKHS)

Key Points

  • •
    An RKHS is a space of functions where evaluating a function at a point equals taking an inner product with a kernel section, which enables the “kernel trick.”
  • •
    A positive definite kernel k(x,y) uniquely defines an RKHS and its inner product via the Moore–Aronszajn theorem.
  • •
    Learning in RKHS often reduces to working with the n×n Gram matrix K where Kij​=k(xi​,xj​); this enables non-linear learning with linear algebra.
  • •
    The Representer Theorem guarantees many regularized learning problems have solutions of the form f(·) = ∑i​ αi​ k(xi​,·).
  • •
    Kernel Ridge Regression (KRR) solves (K + λ I) α = y and predicts by f(x) = k(x,X)^T α.
  • •
    Using Cholesky instead of matrix inversion yields numerically stable and typically faster solutions for kernel methods.
  • •
    Random Fourier Features approximate shift-invariant kernels like the RBF, reducing O(n2) memory to O(nD), where D is the feature dimension.
  • •
    Choosing kernel hyperparameters (e.g., RBF bandwidth γ) and regularization λ critically affects generalization and conditioning.

Prerequisites

  • →Linear Algebra (vectors, inner products, PSD matrices) — RKHS relies on inner products, Gram matrices, and factorizations like Cholesky.
  • →Calculus and Real Analysis (basic) — Understanding function spaces, continuity, and norms benefits from calculus foundations.
  • →Probability (Gaussians, expectations) — Bochner’s theorem, Random Fourier Features, and Gaussian Processes use probabilistic tools.
  • →Optimization (convexity, regularization) — RKHS learning problems are regularized ERM with convex objectives.
  • →Numerical Linear Algebra — Efficient and stable solutions require solving linear systems without explicit inversion.

Detailed Explanation

Tap terms for definitions

01Overview

Hook: Imagine plotting data points in 2D that are not linearly separable. You wish you could lift them into some higher dimension where a straight line would work. Reproducing Kernel Hilbert Spaces (RKHS) let you do exactly that—without ever computing the lift directly. Concept: An RKHS is a Hilbert space of functions tied to a positive definite kernel k(x, y). The kernel plays double duty: it defines an implicit feature map \phi and the inner product in the function space through k(x, y) = \langle \phi(x), \phi(y) \rangle. Because we only need k to compute inner products, we can learn flexible, non-linear models using simple linear algebra on the n×n Gram matrix. Example: In kernel ridge regression, the best-fitting function under a smoothness penalty has the form f(·) = \sum_i \alpha_i k(x_i, ·). The coefficients \alpha come from solving one linear system involving the Gram matrix, avoiding explicit high-dimensional features entirely.

02Intuition & Analogies

Think of a music equalizer. Raw sound is complicated, but sliding the equalizer knobs emphasizes certain frequencies so the music sounds better. Similarly, many datasets are tangled in their original space. A kernel is like a smart equalizer that implicitly maps inputs into a (possibly infinite-dimensional) space where the data becomes simpler—often linearly separable or easier to approximate. You never see the knobs (the feature map \phi); you just hear the result (the kernel value k(x, y) that acts as an inner product there). Another analogy: Suppose you want to measure how similar two recipes are, but comparing ingredients one by one is tedious. Instead, you could ask a chef to give you a single similarity score that encodes their expert judgment of overlap in flavor, technique, and style. That chef’s score is your kernel. It saves you from manually enumerating and aligning features. The ‘reproducing’ part means evaluating a function at x is the same as taking an inner product with a special function anchored at x, k(·, x). So to know f(x), you just correlate f with a kernel bump centered at x. This viewpoint explains why many regularized learning solutions boil down to weighted sums of such bumps around training points, letting us sculpt complex functions by adding or reducing the influence of each point.

03Formal Definition

Let X be a nonempty set and k: X × X → R be a symmetric, positive definite kernel: for any finite \{xi​\}_{i=1}^n ⊂ X and any coefficients c1​,…,cn​, we have ∑i,j=1n​ ci​ cj​ k(xi​, xj​) ≥ 0. The Moore–Aronszajn theorem states there exists a unique Hilbert space H of functions f: X → R with inner product ⟨ ⋅, ⋅ ⟩H​ such that: (1) For every x ∈ X, k(⋅, x) ∈ H. (2) Reproducing property: for all f ∈ H and x ∈ X, f(x) = ⟨ f, k(⋅, x) ⟩H​. The kernel determines H and its inner product; conversely, any RKHS has such a kernel. For a dataset X = (x1​,…,xn​), define the Gram matrix K ∈ Rn×n by Kij​=k(xi​, xj​); K is symmetric positive semidefinite. Many regularized empirical risk minimization problems over H, e.g., minimizing ∑i​ ℓ(yi​, f(xi​)) + λ ∥ f ∥H2​, admit solutions of the form f(·) = ∑i=1n​ αi​ k(xi​, ·) (Representer Theorem).

04When to Use

  • Nonlinear patterns with limited to medium-sized datasets (e.g., n up to a few tens of thousands) where flexibility matters but you still can form an n×n Gram matrix. - When you need theoretically grounded similarity functions and guarantees (positive definiteness), e.g., SVMs, Gaussian Processes, Kernel Ridge Regression, Kernel PCA, and MMD-based two-sample testing. - When features are hard to engineer but a good similarity can be defined (strings, graphs, distributions) via custom kernels. - When you want convex training with global optima (e.g., KRR or SVMs) instead of deep non-convex optimization. - When interpretability in terms of influential training points is useful; solutions are sums of kernel sections k(·, x_i). Avoid pure kernel methods when n is huge and memory/time for O(n^2) or O(n^3) operations is prohibitive; consider approximations like Random Fourier Features or Nyström in that case.

⚠️Common Mistakes

  • Using kernels that are not positive semidefinite, yielding invalid Gram matrices and unstable training. Always test PSD via Cholesky factorization; add small jitter if needed. - Explicitly inverting (K + \lambda I). Numerical inversion is unstable and slower; solve linear systems via Cholesky or CG. - Mis-tuning hyperparameters: RBF bandwidth \gamma (or \sigma) and regularization \lambda. Too small \gamma overfits with spiky kernels; too large \gamma underfits. Cross-validate on a log-scale grid. - Ignoring feature scaling. RBF kernels are very sensitive to input scales; standardize features. - Forgetting regularization (\lambda = 0) with noisy data leads to ill-conditioning and overfitting. - Confusing kernel trick with arbitrary similarity. Not every similarity is a valid kernel; require PSD. - For kernel PCA, forgetting to center the Gram matrix in feature space causes incorrect components. - Using too few Random Fourier Features (D too small), leading to high variance approximations; increase D until kernel approximation error is acceptable.

Key Formulas

Reproducing Property

f(x)=⟨f,k(⋅,x)⟩H​

Explanation: In an RKHS, evaluating a function at x equals its inner product with the kernel section centered at x. This ties evaluation to geometry in the function space.

Gram Matrix

Kij​=k(xi​,xj​)

Explanation: The Gram matrix collects all pairwise kernel evaluations among training points. It is the main object used by kernel algorithms.

Positive Definiteness of Kernel

i=1∑n​j=1∑n​ci​cj​k(xi​,xj​)≥0

Explanation: A valid kernel yields nonnegative quadratic forms for any coefficients c. Equivalently, every Gram matrix built from the kernel is positive semidefinite.

Representer Theorem Form

f∗(⋅)=i=1∑n​αi​k(xi​,⋅)

Explanation: For many regularized learning problems in an RKHS, the optimal solution is a finite linear combination of kernel sections at training points.

KRR Dual Solution

α=(K+λI)−1y

Explanation: Kernel ridge regression solves a linear system in the Gram matrix to find coefficients. In practice, avoid forming the inverse; solve via Cholesky.

KRR Prediction

f(x)=k(x,X)⊤α=i=1∑n​αi​k(x,xi​)

Explanation: Once α is known, predictions are kernel-weighted sums over training points. Complexity is linear in the number of training samples.

RBF (Gaussian) Kernel

k(x,y)=exp(−γ∥x−y∥2)

Explanation: A popular, smooth, shift-invariant kernel with bandwidth parameter γ. Small γ spreads influence broadly; large γ makes it very local.

Polynomial Kernel

k(x,y)=(x⊤y+c)p

Explanation: Encodes all monomials up to degree p (and interaction terms) implicitly. The offset c shifts the feature space.

Mercer Expansion

k(x,y)=i=1∑∞​λi​ϕi​(x)ϕi​(y),λi​≥0

Explanation: A PSD kernel can be expanded in orthonormal eigenfunctions with nonnegative weights. This provides a spectral view of the RKHS.

Bochner Representation

k(x−y)=∫Rd​eiw⊤(x−y)p(w)dw

Explanation: Any continuous, shift-invariant PSD kernel is the Fourier transform of a nonnegative measure. Sampling w from p(w) leads to Random Fourier Features.

KRR Objective

f∈Hmin​n1​i=1∑n​(yi​−f(xi​))2+λ∥f∥H2​

Explanation: Balances data fit (least squares) against RKHS norm complexitysmoothness​. The solution is unique and given by the dual coefficients.

Complexity Analysis

Core RKHS algorithms rely on the n×n Gram matrix K. Building K requires O(n2 d) time for d-dimensional inputs and O(n2) memory. Kernel Ridge Regression (KRR) solves (K + λI)α = y. A dense Cholesky factorization costs O(n3) time and O(n2) space, with subsequent triangular solves costing O(n2). Prediction for a single test point costs O(n dk​) to form the kernel vector k(x, X) and O(n) to take a dot-product with α (often the kernel evaluation dominates). If many test points are needed, total prediction time is O(n m dk​) for m tests. Numerical stability is enhanced by adding a small jitter ε to the diagonal and using Cholesky instead of explicit inversion, avoiding O(n3) inverse computation and reducing rounding errors. The Random Fourier Features (RFF) approximation replaces the kernel with a D-dimensional explicit feature map z(x). Building z(x) costs O(D d), training a linear model is O(n D2) (closed-form ridge) or O(n D) per pass with SGD, and memory drops to O(n D). For fixed accuracy, D typically scales with the intrinsic complexity of the kernel and data; increasing D reduces approximation variance but increases compute. PSD checks via Cholesky attempt cost O(n3); they are practical for moderate n but not for very large datasets. In summary: exact kernel methods are best for small-to-medium n; approximations like RFF or Nyström are needed when n is large to keep both time and memory manageable.

Code Examples

Kernel utilities: define kernels, build Gram matrix, and test PSD via Cholesky
1#include <bits/stdc++.h>
2using namespace std;
3
4using Vec = vector<double>;
5using Mat = vector<vector<double>>;
6using Kernel = function<double(const Vec&, const Vec&)>;
7
8// Basic helpers
9static double dot(const Vec& a, const Vec& b) {
10 double s = 0.0; for (size_t i = 0; i < a.size(); ++i) s += a[i]*b[i]; return s;
11}
12static double sqL2(const Vec& a, const Vec& b) {
13 double s = 0.0; for (size_t i = 0; i < a.size(); ++i) { double d = a[i]-b[i]; s += d*d; } return s;
14}
15
16// Common kernels
17Kernel linearKernel = [](const Vec& x, const Vec& y){ return dot(x,y); };
18Kernel polynomialKernel(double c, int p) {
19 return [c,p](const Vec& x, const Vec& y){ return pow(dot(x,y) + c, p); };
20}
21Kernel rbfKernel(double gamma) {
22 return [gamma](const Vec& x, const Vec& y){ return exp(-gamma * sqL2(x,y)); };
23}
24
25Mat gramMatrix(const vector<Vec>& X, const Kernel& k) {
26 size_t n = X.size();
27 Mat K(n, Vec(n, 0.0));
28 for (size_t i = 0; i < n; ++i) {
29 K[i][i] = k(X[i], X[i]);
30 for (size_t j = i+1; j < n; ++j) {
31 double v = k(X[i], X[j]);
32 K[i][j] = v; K[j][i] = v;
33 }
34 }
35 return K;
36}
37
38// Cholesky factorization for SPD matrices: A = L L^T
39bool cholesky(const Mat& A, Mat& L) {
40 size_t n = A.size();
41 L.assign(n, Vec(n, 0.0));
42 for (size_t i = 0; i < n; ++i) {
43 for (size_t j = 0; j <= i; ++j) {
44 double sum = A[i][j];
45 for (size_t k = 0; k < j; ++k) sum -= L[i][k] * L[j][k];
46 if (i == j) {
47 if (sum <= 0.0 || !isfinite(sum)) return false; // not SPD
48 L[i][i] = sqrt(max(sum, 0.0));
49 } else {
50 if (L[j][j] == 0.0) return false;
51 L[i][j] = sum / L[j][j];
52 }
53 }
54 }
55 return true;
56}
57
58// Solve (L L^T) x = b with forward/back substitution
59Vec solveWithCholesky(const Mat& L, const Vec& b) {
60 size_t n = L.size();
61 Vec y(n, 0.0), x(n, 0.0);
62 // Forward: L y = b
63 for (size_t i = 0; i < n; ++i) {
64 double sum = b[i];
65 for (size_t k = 0; k < i; ++k) sum -= L[i][k] * y[k];
66 y[i] = sum / L[i][i];
67 }
68 // Backward: L^T x = y
69 for (int i = (int)n-1; i >= 0; --i) {
70 double sum = y[i];
71 for (size_t k = i+1; k < n; ++k) sum -= L[k][i] * x[k];
72 x[i] = sum / L[i][i];
73 }
74 return x;
75}
76
77// PSD test by trying Cholesky with small jitter if necessary
78bool isPSD(Mat K) {
79 size_t n = K.size();
80 Mat L;
81 if (cholesky(K, L)) return true;
82 // Try adding jitter progressively
83 double eps = 1e-12;
84 for (int t = 0; t < 6; ++t) {
85 for (size_t i = 0; i < n; ++i) K[i][i] += eps;
86 if (cholesky(K, L)) return true;
87 eps *= 10.0;
88 }
89 return false;
90}
91
92int main(){
93 // Toy 2D dataset
94 vector<Vec> X = { {0.0, 0.0}, {1.0, 0.5}, {0.5, 1.5}, {2.0, 2.0} };
95
96 // Define kernels
97 auto lin = linearKernel;
98 auto poly = polynomialKernel(1.0, 3); // (x^T y + 1)^3
99 auto rbf = rbfKernel(0.5); // gamma = 0.5
100
101 // Build Gram matrices
102 Mat K_lin = gramMatrix(X, lin);
103 Mat K_poly = gramMatrix(X, poly);
104 Mat K_rbf = gramMatrix(X, rbf);
105
106 cout << boolalpha;
107 cout << "Linear kernel PSD? " << isPSD(K_lin) << "\n";
108 cout << "Polynomial kernel PSD? " << isPSD(K_poly) << "\n";
109 cout << "RBF kernel PSD? " << isPSD(K_rbf) << "\n";
110
111 // Print a Gram matrix
112 cout << "RBF Gram matrix:\n";
113 cout.setf(std::ios::fixed); cout<<setprecision(4);
114 for (auto& row : K_rbf) {
115 for (double v : row) cout << setw(9) << v << ' ';
116 cout << '\n';
117 }
118 return 0;
119}
120

This program defines common kernels (linear, polynomial, RBF), builds the Gram matrix for a small dataset, and checks positive semidefiniteness by attempting a Cholesky factorization (with diagonal jitter for numerical robustness). A successful Cholesky factorization is a practical PSD certificate for Gram matrices.

Time: Building K: O(n^2 d). Cholesky factorization: O(n^3).Space: O(n^2) for the Gram matrix and O(n^2) for the Cholesky factor.
Kernel Ridge Regression (KRR) with Cholesky solver
1#include <bits/stdc++.h>
2using namespace std;
3using Vec = vector<double>; using Mat = vector<vector<double>>;
4using Kernel = function<double(const Vec&, const Vec&)>;
5
6static double sqL2(const Vec& a, const Vec& b){ double s=0; for(size_t i=0;i<a.size();++i){ double d=a[i]-b[i]; s+=d*d;} return s; }
7Kernel rbfKernel(double gamma){ return [gamma](const Vec& x, const Vec& y){ return exp(-gamma*sqL2(x,y)); }; }
8
9Mat gramMatrix(const vector<Vec>& X, const Kernel& k){ size_t n=X.size(); Mat K(n, Vec(n,0.0)); for(size_t i=0;i<n;++i){ for(size_t j=0;j<=i;++j){ double v=k(X[i],X[j]); K[i][j]=v; K[j][i]=v; } } return K; }
10
11bool cholesky(const Mat& A, Mat& L){ size_t n=A.size(); L.assign(n, Vec(n,0.0)); for(size_t i=0;i<n;++i){ for(size_t j=0;j<=i;++j){ double sum=A[i][j]; for(size_t k=0;k<j;++k) sum-=L[i][k]*L[j][k]; if(i==j){ if(sum<=0.0||!isfinite(sum)) return false; L[i][i]=sqrt(max(sum,0.0)); } else { if(L[j][j]==0.0) return false; L[i][j]=sum/L[j][j]; } } } return true; }
12Vec solveWithCholesky(const Mat& L, const Vec& b){ size_t n=L.size(); Vec y(n,0.0), x(n,0.0); for(size_t i=0;i<n;++i){ double s=b[i]; for(size_t k=0;k<i;++k) s-=L[i][k]*y[k]; y[i]=s/L[i][i]; } for(int i=(int)n-1;i>=0;--i){ double s=y[i]; for(size_t k=i+1;k<n;++k) s-=L[k][i]*x[k]; x[i]=s/L[i][i]; } return x; }
13
14struct KernelRidgeRegression {
15 Kernel k; double lambda; vector<Vec> Xtrain; Vec alpha; bool fitted=false;
16 KernelRidgeRegression(Kernel kernel, double lam): k(kernel), lambda(lam) {}
17 void fit(const vector<Vec>& X, const Vec& y){
18 Xtrain = X; size_t n = X.size();
19 // Build K and add regularization to diagonal
20 Mat K = gramMatrix(X, k);
21 for(size_t i=0;i<n;++i) K[i][i] += lambda;
22 // Cholesky and solve
23 Mat L; bool ok = cholesky(K, L);
24 if(!ok){ // add jitter if needed
25 for(size_t i=0;i<n;++i) K[i][i] += 1e-10; ok = cholesky(K,L);
26 if(!ok) throw runtime_error("Cholesky failed: kernel may be non-PSD or ill-conditioned");
27 }
28 alpha = solveWithCholesky(L, y);
29 fitted = true;
30 }
31 double predict_one(const Vec& x) const{
32 if(!fitted) throw runtime_error("Model not fitted");
33 double s=0.0; for(size_t i=0;i<Xtrain.size();++i) s += alpha[i]*k(x, Xtrain[i]); return s;
34 }
35 vector<double> predict(const vector<Vec>& X) const{ vector<double> out; out.reserve(X.size()); for(const auto& x: X) out.push_back(predict_one(x)); return out; }
36};
37
38int main(){
39 // Generate 1D regression data: y = sin(2*pi*x) + noise
40 int n = 80; std::mt19937 rng(42); std::uniform_real_distribution<double> U(0.0,1.0); std::normal_distribution<double> N(0.0, 0.1);
41 vector<Vec> X; Vec y; X.reserve(n); y.reserve(n);
42 for(int i=0;i<n;++i){ double xi = U(rng); X.push_back({xi}); y.push_back(sin(2*M_PI*xi) + N(rng)); }
43
44 auto k = rbfKernel(50.0); // gamma; sensitive to input scale—here x in [0,1]
45 KernelRidgeRegression krr(k, 1e-2);
46 krr.fit(X, y);
47
48 // Predict on a grid and compute MSE on training data
49 double mse = 0.0; for(int i=0;i<n;++i){ double yi_hat = krr.predict_one(X[i]); double e = yi_hat - y[i]; mse += e*e; }
50 mse /= n; cout.setf(std::ios::fixed); cout<<setprecision(6);
51 cout << "Training MSE: " << mse << "\n";
52
53 // Show a few predictions
54 for(double xq : {0.0, 0.25, 0.5, 0.75, 1.0}){
55 double pred = krr.predict_one(Vec{ xq });
56 cout << "x=" << setw(5) << xq << " f(x)≈ " << setw(9) << pred << '\n';
57 }
58 return 0;
59}
60

This implements Kernel Ridge Regression using the RBF kernel. Training builds the Gram matrix, adds λ to the diagonal, performs a Cholesky factorization, and solves for α. Predictions are kernel-weighted sums over training points. The example fits a noisy sin curve, showing the power of RKHS-based smooth regression.

Time: Training: O(n^2 d) to build K plus O(n^3) for Cholesky. Prediction per query: O(n d) for kernel vector and O(n) dot-product.Space: O(n^2) for K and O(n) for α.
Random Fourier Features (RFF) for approximating the RBF kernel
1#include <bits/stdc++.h>
2using namespace std;
3using Vec = vector<double>;
4
5// RFF transformer for RBF kernel: z(x) in R^D with z(x)^T z(y) ≈ k(x,y)
6struct RFF {
7 int D; int d; double gamma; // RBF: k(x,y)=exp(-gamma||x-y||^2)
8 vector<Vec> W; // D x d Gaussian rows
9 vector<double> b; // D phases in [0, 2pi]
10 std::mt19937 rng;
11
12 RFF(int D_, int d_, double gamma_, uint32_t seed=42): D(D_), d(d_), gamma(gamma_), rng(seed) {
13 // For RBF with parameter gamma, spectral density is Gaussian with variance 2*gamma
14 double sigma2 = 2.0 * gamma; // w ~ N(0, 2*gamma I)
15 std::normal_distribution<double> N(0.0, sqrt(sigma2));
16 std::uniform_real_distribution<double> U(0.0, 2.0*M_PI);
17 W.assign(D, Vec(d, 0.0)); b.assign(D, 0.0);
18 for (int i = 0; i < D; ++i) {
19 for (int j = 0; j < d; ++j) W[i][j] = N(rng);
20 b[i] = U(rng);
21 }
22 }
23
24 Vec transform(const Vec& x) const {
25 Vec z(D, 0.0);
26 double scale = sqrt(2.0 / D);
27 for (int i = 0; i < D; ++i) {
28 double proj = 0.0; for (int j = 0; j < d; ++j) proj += W[i][j] * x[j];
29 z[i] = scale * cos(proj + b[i]);
30 }
31 return z;
32 }
33};
34
35static double sqL2(const Vec& a, const Vec& b) { double s=0; for(size_t i=0;i<a.size();++i){ double d=a[i]-b[i]; s+=d*d; } return s; }
36static double rbf_exact(const Vec& x, const Vec& y, double gamma){ return exp(-gamma * sqL2(x,y)); }
37
38int main(){
39 // 3D random points
40 int n = 1000; int d = 3; double gamma = 1.0; std::mt19937 rng(123);
41 std::normal_distribution<double> N(0.0, 1.0);
42 vector<Vec> X(n, Vec(d,0.0)); for(int i=0;i<n;++i){ for(int j=0;j<d;++j) X[i][j] = N(rng); }
43
44 // Build RFF transformer with varying D
45 for (int D : {100, 500, 2000}) {
46 RFF rff(D, d, gamma, 7);
47 // Estimate kernel approximation error over random pairs
48 std::uniform_int_distribution<int> UI(0, n-1);
49 double mae = 0.0; int trials = 2000;
50 for (int t = 0; t < trials; ++t) {
51 int i = UI(rng), j = UI(rng);
52 Vec zx = rff.transform(X[i]);
53 Vec zy = rff.transform(X[j]);
54 double approx = inner_product(zx.begin(), zx.end(), zy.begin(), 0.0);
55 double exact = rbf_exact(X[i], X[j], gamma);
56 mae += fabs(approx - exact);
57 }
58 mae /= trials;
59 cout.setf(std::ios::fixed); cout<<setprecision(6);
60 cout << "D=" << setw(5) << D << " Mean |approx-exact| = " << mae << "\n";
61 }
62 return 0;
63}
64

This demonstrates Random Fourier Features for the RBF kernel. It samples Gaussian frequencies according to Bochner’s theorem and maps x to z(x) so that z(x)^T z(y) approximates k(x,y). Increasing D improves the approximation while keeping memory linear in D instead of n.

Time: Transform per point: O(D d). Building features for n points: O(n D d).Space: O(D d) to store W and O(D) for b; storing transformed data costs O(n D).
#rkhs#kernel trick#gram matrix#positive semidefinite#cholesky#kernel ridge regression#representer theorem#rbf kernel#random fourier features#mercer theorem#bochner theorem#kernel pca#regularization#bandwidth selection#feature map