📚 Stanford CME295 Transformers & LLMs7 / 9

Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 7 - Agentic LLMs

Beginner

Stanford

Machine LearningYouTube

Key Summary

•This lecture explains L1 regularization, also called LASSO, as a way to prevent overfitting by adding a penalty to the loss that depends on the absolute values of model weights. Overfitting means a model memorizes the training data but fails on new data. By penalizing large weights, L1 helps the model focus on the strongest, most useful features.
•Regularization adds a term to the loss: objective = loss + λ × penalty, where λ (lambda) controls how strong the penalty is. A bigger λ means the model cares more about keeping weights small, even if training loss rises a bit. This creates a trade-off between fitting the data and keeping the model simple.
•The L1 penalty uses the L1 norm: the sum of absolute values of the weights. In contrast, L2 (Ridge) uses the sum of squared weights. This difference leads to very different behavior when optimizing.
•L1 often sets some weights exactly to zero, which is called sparsity. Zero weights mean those features are effectively removed, so L1 performs automatic feature selection. L2 usually makes weights small but not exactly zero, so it shrinks without selecting.
•The geometric picture explains why L1 gives zeros: in 2D, the L1 constraint region is a diamond, which has sharp corners on the axes. When the loss contours touch the diamond at a corner, one weight becomes exactly zero.
•For L2, the constraint region is a circle, which is smooth and has no corners. When the loss contours touch the circle, the point rarely lies exactly on an axis. That’s why L2 seldom produces exact zeros.
•When you have many features and believe most are irrelevant, L1 is a strong choice. It can select a smaller, more meaningful subset of features automatically. This also makes the model easier to interpret.

Why This Lecture Matters

L1 regularization is vital for anyone working with high-dimensional data, like data scientists, ML engineers, and analysts in fields such as text mining, bioinformatics, finance, and marketing. It solves the real problem of too many features by automatically selecting a small, useful subset, making models simpler, faster, and easier to interpret. In real projects, you rarely know which features are irrelevant ahead of time; L1 discovers this during training. By controlling overfitting, L1 helps models perform better on new, unseen data—exactly what matters in production. This knowledge maps directly to daily work: you can use L1 to prune bloated feature sets, communicate which inputs matter, and reduce maintenance costs. It also provides a pathway when stakeholders demand transparent models, as sparse coefficients clearly show what drives predictions. Understanding the geometry (diamond vs circle) and the optimization issues (non-differentiability at zero) prepares you to choose the right method and optimizer, and to explain trade-offs to teammates and managers. In today’s industry, where datasets can have thousands or millions of features, L1 is a core tool that turns complexity into clarity—a valuable skill that strengthens your career and the reliability of your models.

Lecture Summary

Tap terms for definitions

01Overview

This lecture teaches L1 regularization (also called LASSO), a powerful technique to prevent overfitting by adding a penalty to a model’s loss function that depends on the sum of absolute values of the model’s weights. Overfitting happens when a model fits training data extremely well but performs poorly on new, unseen data. The key idea behind regularization is to balance two goals at once: fit the data well and keep the model simple. L1 regularization creates simplicity by pushing many weights to be exactly zero, which automatically removes unhelpful features and makes the model easier to interpret.

The lecture starts by reminding you why we regularize at all: models with too many degrees of freedom can latch onto noise in the dataset. To fight this, we add a penalty to the loss, giving the objective function: objective = loss + λ × penalty. The parameter λ (lambda) controls how strong the penalty is; higher λ means stronger regularization, which usually shrinks or zeroes more coefficients. The lecture then contrasts two main types of penalties: L2 (Ridge), which uses the sum of squared weights, and L1 (LASSO), which uses the sum of absolute values. This difference might look small at first, but it leads to very different outcomes when optimizing.

A central part of the lecture is the geometric intuition. In two dimensions with weights w1 and w2, you can imagine the optimization as moving contour lines (like elevation lines on a map) of the loss function until they just touch a “constraint region.” For L2, the constraint region is a circle (because w1^2 + w2^ $2 ≤ c$ ). For L1, the constraint region is a diamond (because |w1| + |w2| ≤ c). Circles are smooth, while diamonds have sharp corners that sit right on the axes. When the loss contours press against the diamond, they are more likely to touch at a corner, which makes one weight exactly zero. This geometric picture explains why L1 creates sparse solutions (lots of zeros), while L2 rarely does.

The lecture then discusses when to use each method. If you have many features and suspect many are irrelevant (common in high-dimensional problems), L1 is very useful because it performs feature selection automatically. If you have a moderate number of features and believe most matter at least a little, L2 is often better, as it shrinks all weights toward zero without turning them off completely. Sometimes you might want both effects: some sparsity and some stability. That is where Elastic Net comes in—a combined penalty that blends L1 and L2, controlled by a mixing hyperparameter.

Next, the talk covers practical pros and cons. Advantages of L1 include built-in feature selection, which leads to simpler, more interpretable models with fewer non-zero coefficients. This can also reduce storage and computation for prediction. Disadvantages include computational considerations: the absolute value function in L1 is not differentiable at zero (it has a sharp corner), so basic gradient descent doesn’t apply directly. Instead, you use methods like subgradient descent or coordinate descent, which can be slower than the smooth-gradient methods commonly used for L2. Another drawback is stability: L1 solutions can change a lot if the data changes slightly, especially when features are highly correlated, because different features may take turns being selected.

The instructor also answers common questions. Why is L1 computationally heavier? Because its penalty is non-differentiable at zero, forcing us to use slower methods like subgradient or coordinate descent. Why not just remove irrelevant features before training? You often don’t know which ones are irrelevant, or they may have tiny effects that are still useful; L1 can discover this automatically and keep small but real signals if λ isn’t overly large. Finally, choosing λ is critical and is typically done by cross-validation: you try different λ values, check validation performance, and pick the best balance.

By the end, you should be able to: explain why we regularize, write the L1 objective function, describe how λ controls the trade-off between fit and simplicity, explain geometrically why L1 sets coefficients to zero while L2 does not, decide when to use L1 vs L2 vs Elastic Net, and understand the computational differences and algorithms used for L1. The lecture is designed for beginners to intermediate learners who know basic linear models and loss functions. Prior knowledge helpful here includes linear regression, the idea of fitting a loss function, and basic optimization concepts like gradients. With these ideas, you can apply L1 to build simpler, more interpretable, and better-generalizing models.

Key Takeaways

✓Use L1 when you need feature selection: It sets many coefficients to zero, trimming your model to essentials. Start with a wide feature set, standardize features, and tune λ by cross-validation. Watch the validation curve to avoid over- or under-regularizing. Report the final set of non-zero features for interpretability.
✓Prefer L2 when all features matter somewhat: It keeps all coefficients but shrinks them. This is more stable when data changes slightly or features are correlated. Tune λ to balance bias and variance. Expect smoother training with standard gradient methods.
✓Try Elastic Net when features are correlated: Pure L1 may pick one feature and drop its twins, causing instability. Elastic Net blends L1 and L2 to keep small groups and improve robustness. Tune both α and λ to get a good mix of sparsity and stability. Validate thoroughly to confirm gains.
✓Choose λ with cross-validation: Use a logarithmic grid over several orders of magnitude. Plot validation error versus λ to find the minimum or the simplest model within one standard error of the best. Refit on the full training set with the chosen λ. Finally, test on a held-out set.
✓Standardize features before L1: Different scales can distort which coefficients are penalized more. Scaling ensures fair comparison across features. This leads to more reliable selection. Always scale train and apply the same transform to validation/test.
✓Understand the geometric intuition: L1’s diamond-shaped constraint favors solutions on axes, causing zeros; L2’s circle does not. This picture explains observed behavior in practice. Use it to justify modeling choices to stakeholders. It also helps debug unexpected sparsity or lack of it.

Glossary

L1 regularization

A method that adds the sum of absolute values of model weights to the loss to discourage complex models. It tends to push many weights to exactly zero. This creates simple, easy-to-understand models. It helps stop the model from memorizing noise.

LASSO

Another name for L1 regularization, short for Least Absolute Shrinkage and Selection Operator. It both shrinks and selects by setting some weights to zero. This gives sparse models. It’s useful when many features are not helpful.

L2 regularization

A method that adds the sum of squared weights to the loss. It shrinks all weights toward zero but usually doesn’t make them exactly zero. It’s smoother and easier to optimize. It helps stabilize models.

Ridge regression

The common name for L2 regularization applied to linear regression. It adds a squared-weight penalty to the mean squared error loss. This prevents large coefficients. It’s helpful when all features matter a little.

Regularization

Any technique that limits model complexity to reduce overfitting. It adds a penalty to the loss to discourage overly flexible models. This improves performance on new data. It balances fit and simplicity.

Version: 1

Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 7 - Agentic LLMs

Key Summary

Why This Lecture Matters

Lecture Summary

01Overview

Key Takeaways

Glossary

L1 regularization

LASSO

L2 regularization

Ridge regression

Regularization

02Key Concepts

03Technical Details

04Examples

05Conclusion

Overfitting

Feature selection

Sparsity