📚 Stanford CME295 Transformers & LLMs8 / 9

Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 8 - LLM Evaluation

Intermediate

Stanford

Machine LearningYouTube

Key Summary

•Kernel methods turn simple linear algorithms into powerful non-linear ones. Instead of drawing only straight lines to separate data, they let us curve and bend the boundary by working in a higher-dimensional feature space. This keeps training simple while unlocking complex patterns.
•A feature map, often written as φ(x), transforms the original input into new features. A linear model in this new space becomes a non-linear model back in the original space. This trick helps solve problems where straight-line boundaries fail.
•In a 2D example, a straight line could not separate blue and orange points, but a circle could. By mapping (x1, x2) to features like (x1^2, x2^2, √2 x1 x2), we can create models that behave like circles or other curves in the original space. This shows how feature maps enable non-linear decision boundaries.
•The kernel trick avoids computing φ(x) explicitly. A kernel function k(x, x') equals the dot product of φ(x) and φ(x') in feature space. This lets algorithms use rich features without the heavy cost of building them.
•The polynomial kernel k(x, x') = (x^T x' + c)^d represents all polynomial combinations up to degree d. It saves the cost of explicitly enumerating all polynomial features, which can explode as d grows. With it, we get non-linear power using only the kernel function.
•A worked example: with c=0 and d=2, (x^T x')^2 equals x1^2 x1'^2 + x2^2 x2'^2 + 2 x1 x2 x1' x2'. This matches the dot product of feature vectors [x1^2, x2^2, √2 x1 x2]. So a degree-2 polynomial kernel implicitly computes the same as a specific quadratic feature map.
•Choosing a feature map is part art and part science. Common, well-tested choices include polynomial and Gaussian (RBF) feature families. Domain knowledge and experimentation guide which kernel and parameters to use.

Why This Lecture Matters

Kernel methods let you keep the comfort and stability of linear algorithms while unlocking the power of non-linear modeling. This is valuable for data scientists and ML engineers who need accuracy on complex patterns but want reliable, well-understood optimization. Instead of inventing a brand-new non-linear method, you can often kernelize an existing linear one by swapping dot products for kernel functions. This approach reduces engineering overhead while achieving sophisticated decision boundaries. In real projects, kernel methods help when classes are not linearly separable or when relationships involve interactions and curved structures. The polynomial kernel captures feature interactions without enumerating them. The RBF kernel captures smooth, local similarities in an intuitive way: nearby points act alike. The representer theorem guarantees that solutions reduce to learning a finite set of coefficients, which is practical and interpretable in terms of similarity to training data. Professionally, knowing kernels strengthens your toolbox beyond plain linear models and deep networks. You can quickly prototype powerful models and benchmark performance before moving to heavier solutions. Kernels also connect to advanced topics—kernel PCA for non-linear dimensionality reduction and Gaussian processes for probabilistic modeling—broadening your capabilities. In the modern ML landscape, where flexibility and reliability both matter, kernel methods remain a cornerstone technique worth mastering.

Lecture Summary

Tap terms for definitions

01Overview

This lecture explains kernel methods, a family of techniques that transform simple linear learning algorithms into powerful non-linear ones. The central idea is to map inputs into a higher-dimensional feature space where linear models can create curved decision boundaries when viewed back in the original space. This mapping is called a feature map and is typically written as φ(x). Although feature maps can be explicit, the lecture’s core insight is the kernel trick: rather than compute φ(x) directly, we use a kernel function k(x, x') that equals the dot product of φ(x) and φ(x') in the feature space. With this substitution, many linear algorithms can operate as if they had rich non-linear features, without ever building them explicitly.

The lecture starts with motivation: many real-world relationships are non-linear, so linear models are often too simple. Non-linear models can be powerful but are sometimes hard to train and can involve many parameters. Kernel methods bridge this gap: they allow the use of well-understood linear optimization techniques while achieving non-linear behavior through feature maps and kernels. The instructor demonstrates this with a 2D classification example: a straight line cannot separate two classes, but a circle can. By mapping (x1, x2) to a higher-dimensional set of features, and then learning a linear separator in that space, the model can produce a circular boundary in the original space.

Next, the instructor shows specific feature maps and their properties. One quadratic map φ(x) = [x1^2, x2^2, √2 x1 x2] has a special property: the dot product of feature vectors equals ( $x^T$ x')^2. This illustrates how some feature maps correspond to simple kernel functions like the degree-2 polynomial kernel. However, not all feature maps are equally helpful: with certain weights, the first map produced degenerate boundaries. Adding original linear terms to the map (φ(x) = $\begin{pmatrix} x1^2 \\ x2^2 \\ x1 \\ x2 \end{pmatrix}$ ) allowed a circular decision boundary centered at (1/2, 1/2), showing the importance of expressiveness.

The lecture then formalizes the kernel trick. A kernel is any function k(x, x') that acts as an inner product in some (possibly infinite) feature space: k(x, x') = φ(x)^T φ(x'). By replacing every occurrence of a dot product in a linear algorithm with k(x, x'), we can run that algorithm in the implicit feature space. The polynomial kernel k(x, x') = ( $x^T$ x' + c)^d is a standard example that corresponds to including all polynomial terms up to degree d without explicitly constructing them. This avoids the computational blow-up that happens when enumerating all polynomial features at high degrees.

Next, the lecture covers when a function is a valid kernel. Mercer’s theorem provides the condition: k must be symmetric and positive semi-definite (PSD). In practical terms, given any set of inputs, the resulting kernel (Gram) matrix K with entries $K_i$ j = k( $x_i$ , $x_j$ ) must be PSD. The instructor also explains how to check PSD: ensure all eigenvalues are non-negative, or (for a symmetric matrix) check that all leading principal minors (determinants of upper-left submatrices) are non-negative.

The lecture then surveys common kernels. The Gaussian or radial basis function (RBF) kernel k(x, x') = exp(-||x - x'||^2 / (2σ^2)) depends only on the distance between points and corresponds to an infinite-dimensional feature map. The sigmoid kernel k(x, x') = tanh(α $x^T$ x' + c) behaves like a two-layer neural network. Each kernel has parameters (e.g., σ for RBF, degree d for polynomial) that control model flexibility. Choice of kernel and parameters is guided by domain knowledge and experimentation.

Finally, the lecture introduces the representer theorem, a foundational result that explains why kernel methods work so broadly. For a large class of regularized learning problems, the solution can be written as f(x) = Σ_i α_i k(x, $x_i$ ). This reduces the problem of searching over an enormous space of functions to solving for a finite set of coefficients attached to training examples. It also justifies the kernel trick and shows how learning reduces to working with the kernel matrix.

The lecture targets students with some basic machine learning background—familiarity with linear models, dot products, and classification. By the end, learners should understand how to construct non-linear models by swapping dot products for kernels, how to choose and validate kernels, and how the representer theorem structures solutions. The session lays the groundwork for upcoming topics like kernel PCA and Gaussian processes, which build directly on the ideas introduced here.

Key Takeaways

✓Start with the kernel trick: if your linear algorithm uses dot products, try replacing them with a kernel. This often gives you a non-linear version with almost no extra code. It’s a quick win to test whether non-linear structure improves accuracy. Keep your original training pipeline and just switch in the kernel function.
✓Use RBF as a strong default when you expect smooth, local patterns. Tune σ: small σ fits tightly (risking overfit), large σ smooths broadly (risking underfit). Perform cross-validation to pick a good σ. Watch learning curves to diagnose when to increase or decrease σ.
✓Consider polynomial kernels when feature interactions matter globally. Choose degree d carefully: higher d adds flexibility but increases overfitting risk. Use the constant c to balance lower- and higher-order effects. Validate multiple (d, c) settings to find the right complexity.
✓Always check or ensure PSD of your kernel matrix for stability. If eigenvalues show small negative values from rounding, add a small εI to K. Large negative values suggest the kernel or parameters are unsuitable. Stable PSD matrices lead to well-behaved optimization.
✓Leverage the representer theorem to simplify thinking about solutions. Remember that you only need to learn coefficients tied to training points. This turns infinite-dimensional function fitting into finite parameter learning. It also clarifies why prediction reduces to summing kernel similarities.
✓Add lower-order terms to increase feature map expressiveness when needed. If a non-linear map gives degenerate or weak boundaries, include linear terms as well. This can unlock useful shapes like circles or ellipses. Don’t assume higher degree alone fixes expressiveness gaps.

Glossary

Kernel methods

A set of techniques that let linear algorithms learn non-linear patterns. They work by pretending data lives in a higher-dimensional space where a straight line can separate it. You don’t need to build these extra features yourself because the kernel trick does it for you. This keeps models simpler to train while becoming more powerful.

Feature map (φ)

A function that converts the original input into new, richer features. In the new space, a straight line can become a curve in the original space. Feature maps can be very high-dimensional or even infinite. They make simple models act complex without changing the learning rules much.

Kernel function

A function that gives the dot product of two inputs in the feature space without building the features. It is written as k(x, x') = φ(x)^T φ(x'). If k is valid, there exists some feature map that makes this true. This lets algorithms work with similarities directly.

Kernel trick

The idea of replacing dot products with a kernel function in algorithms. It saves you from computing huge feature vectors. If the algorithm only needs dot products, it can be run in the new space with no extra code for features. This enables non-linear learning with linear tools.

Version: 1

Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 8 - LLM Evaluation

Key Summary

Why This Lecture Matters

Lecture Summary

01Overview

Key Takeaways

Glossary

Kernel methods

Feature map (φ)

Kernel function

Kernel trick

02Key Concepts

03Technical Details

04Examples

05Conclusion

Dot product

Decision boundary

Polynomial kernel

Gaussian (RBF) kernel