Stanford CS329H: Machine Learning from Human Preferences | Autumn 2024 | Ethics
BeginnerKey Summary
- •The lecture explains regularization, a method to reduce overfitting by adding a penalty to the cost (loss) function that discourages overly complex models. Overfitting is when a model memorizes noise in the training data and fails to generalize. Regularization keeps model parameters (weights) from growing too large, which helps models generalize better to new data.
- •Two main types are covered: L2 regularization (Ridge) and L1 regularization (Lasso). L2 adds a penalty equal to the sum of squared weights, which shrinks weights toward zero but rarely makes them exactly zero. L1 adds a penalty equal to the sum of absolute values of weights, which often pushes some weights exactly to zero, effectively selecting features.
- •The penalty strength is controlled by a hyperparameter called lambda (λ), which you choose, not the model. A larger lambda means stronger penalty and smaller weights; a smaller lambda means weaker penalty and larger weights. Picking lambda balances bias and variance: too large raises bias (underfitting), too small raises variance (overfitting).
- •The lecture uses Mean Squared Error (MSE) as the base cost function: average of squared differences between predictions and true values. Regularization adds a penalty term to this cost, so the optimizer minimizes both data error and weight size. This encourages simpler models that still fit the data reasonably well.
- •A geometric view helps build intuition: without regularization, you find the very bottom of the loss landscape (the center of nested ellipses). With L2 (Ridge), you add a circular constraint on weights; you pick the point where the lowest ellipse just touches the circle (tangent). Making the circle smaller (larger lambda) forces smaller weights.
- •For L1 (Lasso), the constraint is a diamond (sum of absolute values bounded). The lowest ellipse tends to touch the diamond at its corners, which lie on the axes. Touching a corner means one of the weights equals zero, which explains why L1 creates sparse models with some weights exactly zero.
Why This Lecture Matters
Regularization is one of the most important tools in a machine learning practitioner’s kit. Data scientists, ML engineers, analysts, and researchers all face the challenge of overfitting, especially with many features or limited data. By adding a penalty to the loss function, regularization keeps models from becoming too complex and chasing noise, which greatly improves performance on unseen data. It helps solve practical problems like unstable predictions, poor test performance, and confusing models with too many irrelevant features. This knowledge is directly applicable to real work: you can stabilize forecasting models, simplify feature sets for interpretability, and handle correlated inputs gracefully. In domains such as finance, healthcare, and operations, where reliability and trust matter, regularization makes models consistent and safer to deploy. Mastering L1, L2, and Elastic Net lets you tune models to your exact goals—whether that is best predictive accuracy, lean feature sets, or a careful balance of both. Understanding lambda as a hyperparameter and using validation to pick it gives you a repeatable process for building robust models. From a career perspective, being fluent in regularization signals a solid grasp of the fundamentals of generalization. The technique underpins classic linear models, logistic regression, and modern deep learning (via weight decay). In an industry where data is noisy and high-dimensional, and where interpretability can be critical, regularization is not optional; it is essential. Knowing when and how to apply L1, L2, and Elastic Net will make your models more trustworthy, your analysis more transparent, and your deployments more successful.
Lecture Summary
Tap terms for definitions01Overview
This lecture teaches a powerful technique called regularization, which helps machine learning models avoid overfitting. Overfitting happens when a model learns not only the true patterns in the training data but also the random noise. As a result, it performs well on the training set but poorly on new, unseen data. Regularization fixes this by gently punishing complex models, nudging them to be simpler and more stable. The core idea is simple: add a penalty to the cost (loss) function that grows when the model’s parameters (weights) are large.
The lecture focuses on two main types of regularization: L2 regularization (also called Ridge) and L1 regularization (also called Lasso). With L2, you add a term equal to the sum of the squares of the weights, scaled by a hyperparameter called lambda (λ). This makes all weights smaller but rarely exactly zero. With L1, you add a term equal to the sum of the absolute values of the weights, again scaled by lambda. This version can push some weights exactly to zero, which means the model completely ignores those features. Because of this, L1 naturally performs feature selection. There’s also a combined approach called Elastic Net, which mixes L1 and L2 and requires tuning two hyperparameters.
The lecture uses Mean Squared Error (MSE) as the base cost function example. MSE measures how far predictions are from true values, by averaging the squared differences. Regularization adds a penalty term to MSE, so the optimizer tries to minimize both the prediction errors and the size of the weights. This trade-off is controlled by lambda: a bigger lambda means more penalty on large weights and a simpler model, while a smaller lambda means less penalty and a more flexible model. Lambda is a hyperparameter, which means you choose it rather than the model learning it during training.
To build strong intuition, the instructor uses a geometric picture with two weights (W1 and W2) so we can draw everything in two dimensions. The loss function looks like a set of nested ellipses centered at the best-fit point. L2 regularization acts like putting a circle around the origin; you must pick the best point inside that circle. The optimal point ends up at the place where a loss ellipse just touches the circle—like a tire kissing a curb. Making the circle smaller (increasing lambda) forces the solution closer to the origin, shrinking the weights. With L1, the constraint is a diamond; the optimal point often lands on a corner of the diamond, which lies on the axes, meaning one weight becomes exactly zero. This explains why L1 encourages sparse solutions.
This lecture is for beginners and intermediates who want a practical and intuitive understanding of why and how regularization helps generalization. You should know what a cost function is, what model parameters (weights) are, and have a basic grasp of MSE and the bias-variance tradeoff. If you remember that overfitting means too much variance and not enough bias, regularization will make sense because it increases bias slightly to reduce variance a lot.
After this lecture, you will be able to explain what regularization is, name and describe L1 and L2 (Lasso and Ridge), and understand why L1 produces feature selection while L2 produces smooth shrinkage. You will be able to describe the role of lambda, why it is a hyperparameter, and what happens when you increase or decrease it. You will also be able to visualize regularization using the circle (L2) and diamond (L1) constraints and the tangency with loss ellipses. While the lecture centers on linear models and MSE, the same ideas extend to many other models and loss functions.
The structure of the lecture begins by recalling overfitting and the bias-variance tradeoff. It then introduces the basic idea of regularization as a penalty added to the loss. Next, it defines L2 and L1 regularization precisely, explains lambda as a hyperparameter, and highlights the practical consequences: prediction quality versus feature selection. Finally, it cements understanding with a clear geometric visualization: ellipses for loss, a circle for L2, a diamond for L1, and the solution at the point of tangency inside the constraint. The closing note reminds you that other tools like cross-validation help pick lambda well, and that regularization is a central technique for building models that generalize.
Key Takeaways
- ✓Use regularization whenever you see signs of overfitting. Start with L2 (Ridge) for a stable, smooth reduction of weights. Expect slightly higher training error but better validation/test performance. This trade-off is usually beneficial in real applications.
- ✓Choose L1 (Lasso) when you want feature selection and interpretability. L1 can zero out unimportant features, making models simpler to explain. Make sure to standardize features first so the penalty treats them fairly. Inspect which features remain to communicate insights.
- ✓Try Elastic Net if features are correlated and you still want sparsity. Tune both the overall strength and the L1/L2 mix. Elastic Net often outperforms pure L1 or L2 in correlated settings. It shares weight among similar features while pruning weaker ones.
- ✓Tune lambda (λ) carefully using a validation set. Too small λ risks overfitting; too large λ causes underfitting. Sweep λ across a logarithmic range and pick the value with the lowest validation error. Prefer simpler models if performance is similar.
- ✓Always evaluate on unseen data to confirm generalization. Training error alone can mislead, especially without regularization. Keep a clean split between training and validation (and test). Track both errors as you adjust λ to see the bias-variance tradeoff.
- ✓Standardize or normalize features before using L1 or Elastic Net. Different units can bias which features get zeroed. Scaling ensures the penalty is applied fairly to all features. This leads to more reliable feature selection.
- ✓Monitor model complexity while tuning. For L1, track the number of nonzero coefficients; for L2, track average coefficient size. Plot validation error versus λ to find a sweet spot. Use these plots to explain decisions to stakeholders.
Glossary
Overfitting
When a model memorizes the training data, including noise, and does poorly on new data. It looks great during training but fails to generalize. This often happens when the model is too complex. Regularization helps prevent this. The goal is to balance fit and simplicity.
Bias-Variance Tradeoff
A balance between a model being too simple (high bias) and too flexible (high variance). Adding regularization increases bias a little to reduce variance a lot. Finding the right balance improves new-data performance. This is central to building good models.
Regularization
A technique that adds a penalty to the loss to discourage complex models with large weights. It keeps models simpler and reduces overfitting. Common types are L1 and L2. The penalty strength is set by a hyperparameter called lambda.
Cost Function (Loss)
A number that measures how wrong a model’s predictions are. Training tries to make this number small. Regularization adds an extra part to this number to penalize complexity. Minimizing total loss balances fit and simplicity.
