Stanford CS230 | Autumn 2025 | Lecture 3: Full Cycle of a DL project
BeginnerKey Summary
- •This lecture introduces supervised learning for regression, where the goal is to predict a real number (like house price) from input features (like square footage, bedrooms, and location). You represent each example as a d-dimensional vector x with a target y. Linear regression models this relationship with a straight-line formula: f(x) = w^T x + b. The focus is on learning weights w and bias b that best map inputs to outputs.
- •The model parameters are learned by minimizing a loss function, most commonly Mean Squared Error (MSE). MSE averages the squared differences between true values and predictions across all training examples. Squaring makes bigger mistakes count more than small ones. Lower MSE means the model’s predictions are closer to the true values.
- •Gradient descent is used to find w and b that minimize MSE. It starts from an initial guess, computes the gradient (the direction of steepest increase), and moves in the opposite direction to reduce the loss. The size of each move is controlled by a learning rate. This process repeats until the loss stops improving significantly.
- •Choosing the learning rate is crucial: too large and the loss can bounce around or even blow up; too small and training crawls slowly. A good approach is to start small, increase until the loss stops decreasing reliably, then back off. Learning rate schedules gradually reduce the rate over time to stabilize training. Adaptive optimizers like Adam can adjust effective learning rates per parameter automatically.
- •Linear regression has clear advantages: it’s simple, fast, and easy to interpret. It often serves as a strong baseline before trying fancier models. On large datasets, it can be trained efficiently and deployed quickly. If it performs well enough, you might not need anything more complex.
- •However, it assumes a linear relationship between inputs and outputs, which is not always true. It’s sensitive to outliers because MSE squares errors, making extreme points dominate the fit. It can also overfit, learning noise in the training data instead of the true signal. Regularization helps by gently restricting the model’s complexity.
Why This Lecture Matters
This lecture’s content is essential for anyone building predictive systems where the output is a number: data analysts estimating prices, forecasters predicting demand, and engineers modeling performance metrics. Linear regression forms a baseline that is fast to train, easy to interpret, and often surprisingly strong, so it saves time by telling you early whether you need more complex methods. Understanding MSE and gradient descent gives you a reusable playbook for optimizing models: define a clear loss, compute gradients, pick a learning rate, and iterate. Learning about learning rate schedules and adaptive methods like Adam prepares you for practical stability and speed when training on real datasets. Regularization (L1 and L2) directly addresses overfitting, one of the most costly failures in real projects, where a model appears great in development but fails in production. L1 teaches you how to perform feature selection automatically and build sparse, interpretable models; L2 teaches you how to stabilize fits, especially when features correlate, improving robustness. Polynomial regression shows how to capture curves while keeping the simplicity of linear-in-parameter models, making it a pragmatic bridge to non-linear modeling without jumping to complex architectures. By mastering these tools, you gain the ability to quickly prototype, evaluate, and deploy regression models that generalize. In a career context, these skills are foundational: they appear in interviews, code challenges, and day-to-day modeling tasks. In the broader industry, even advanced systems often start with or benchmark against linear models; being fluent here helps you judge when to move to more sophisticated approaches and how to regularize them properly. Ultimately, this knowledge equips you to make data-driven decisions with confidence, control model complexity, and deliver reliable predictions that matter in business and science.
Lecture Summary
Tap terms for definitions01Overview
This lecture teaches the core ideas behind supervised learning for regression, focusing on the classic and foundational method: linear regression. In supervised learning, you are given input–output pairs: each input x is a vector of features (for example, house characteristics like square footage, number of bedrooms, and location), and each output y is the real number you want to predict (such as the house price). The goal is to learn a function f that maps any new input x to a good prediction of y. Linear regression assumes this mapping is a straight line (technically, a hyperplane) in feature space: f(x) = w^T x + b, where w is a vector of weights and b is a bias (intercept).
To measure how well the model predicts, the lecture uses Mean Squared Error (MSE), which averages the squares of the differences between actual values and predicted values. Squaring emphasizes larger mistakes more than smaller ones, which helps the optimization focus on big errors. The parameters w and b are found by minimizing MSE. The tool for this is gradient descent, an iterative algorithm: start with an initial guess for w and b, compute the gradient (the direction that increases the loss fastest), and step in the opposite direction to reduce the loss. The size of each step is determined by a learning rate hyperparameter.
The lecture explains that choosing a good learning rate is essential. If it is too large, the updates can overshoot the minimum, causing the loss to bounce or even explode. If it is too small, progress is painfully slow and training takes a long time. Helpful strategies include starting small and increasing until improvement slows, then backing off, using learning rate schedules that gradually reduce the rate during training, or using adaptive methods like Adam that adjust the effective learning rate for each parameter automatically.
Linear regression is praised for being simple, interpretable, and computationally efficient. It is often recommended as a baseline: try it first, and if it performs sufficiently well, you may not need a more complex model. However, it has limitations. The most important one is that it assumes a linear relationship between features and the target, which is not always realistic. The lecture gives examples of non-linear relationships: electricity demand versus temperature tends to be high at both hot and cold extremes, which looks like a U-shaped curve rather than a straight line; tree height versus age shows rapid growth early and slower growth later, another clearly non-linear pattern. Linear regression is also sensitive to outliers due to the squared-error loss, and it can overfit when it memorizes training noise rather than learning general patterns.
To address overfitting, regularization adds a penalty to the loss that discourages overly complex models. The lecture presents two main types: L1 regularization (lasso) adds the sum of absolute values of the weights to the loss, promoting sparsity (many weights become exactly zero) and thus performing feature selection. L2 regularization (ridge) adds the sum of squared weights, shrinking weights toward zero but rarely turning them off completely, which stabilizes the fit and reduces variance. A hyperparameter lambda controls the strength of this penalty: higher lambda means a stronger push toward simpler models, while lower lambda allows more flexibility.
The lecture also introduces polynomial regression as an extension that can capture non-linear relationships without abandoning linear regression’s simplicity. The trick is to expand the input features by adding polynomial terms (e.g., x^2, x^3), and then apply standard linear regression on the expanded feature set. This keeps the model linear in the parameters but allows curved fits. Ridge regression (linear regression with L2 regularization) and lasso regression (linear regression with L1 regularization) are highlighted as practical tools to control complexity and improve generalization.
By the end, you understand the full loop of building a basic regression model: define the model form (linear), choose a loss (MSE), minimize it with gradient descent (tuning the learning rate), be aware of outliers and non-linearity, and use regularization to prevent overfitting. With these ideas in place, you are set up to move on to logistic regression for classification problems, where many of the same optimization and hyperparameter principles apply. The lecture’s structure flows from the problem setup (supervised regression), to the model and loss, to optimization via gradient descent, to practical concerns (learning rate, pros/cons, outliers), and finally to extensions (regularization and polynomial features), giving a complete and clear foundation for predictive modeling with linear methods.
Key Takeaways
- ✓Start with a clear mapping from inputs to outputs: define features x and target y precisely. Clean up obvious outliers since MSE heavily penalizes large errors. Organize your data into a matrix X (rows are examples, columns are features). This structure makes training and debugging smoother.
- ✓Use the simplest effective model first: linear regression with bias. Its formula f(x) = w^T x + b is interpretable and fast to train. If it performs well, you save time by avoiding unnecessary complexity. If not, you’ll have a clear baseline to beat.
- ✓Minimize Mean Squared Error to train the model. MSE averages squared residuals, accentuating large mistakes so the model pays attention to them. Monitor MSE each iteration to confirm learning progresses. If MSE increases or oscillates, adjust the learning rate.
- ✓Implement gradient descent in a tight, efficient loop. Compute predictions, errors, gradients, and then update parameters. Keep track of loss to gauge progress and detect issues quickly. Stop when improvements are consistently tiny.
- ✓Tune the learning rate carefully; it is the most sensitive hyperparameter in basic training. Too large causes divergence; too small wastes time. Start modestly, then adjust based on the shape of the loss curve. Consider a simple schedule to reduce the rate as training settles.
- ✓Include a bias term to avoid forcing the fit through the origin. The intercept captures the baseline level of your target. Omitting it can add systematic error across all predictions. Always check that b is learned and makes sense.
- ✓Regularize when you see signs of overfitting or instability. L2 (ridge) shrinks weights smoothly and helps with correlated features. L1 (lasso) zeros many weights, acting as feature selection. Choose λ by trying a range and watching generalization behavior.
Glossary
Supervised learning
A way to teach a model using examples that include both inputs and the correct outputs. The model learns patterns that connect inputs to outputs. After training, it can predict outputs for new inputs it has never seen. It’s like a teacher giving practice problems with answer keys.
Regression
A type of supervised learning where the output is a real number. It focuses on predicting continuous values instead of categories. Examples include prices, temperatures, or speeds. It answers questions like 'How much?' or 'How many?'
Linear regression
A simple model that predicts a number by drawing a straight line (or flat plane) through data points. It assumes the relationship between inputs and output is linear. The formula is f(x) = w^T x + b. It is easy to train and understand.
Feature
A measurable piece of information used as input to a model. Features are the ingredients that help the model make predictions. They are usually numbers in a vector. Good features carry useful signals about the target.
Target (label)
The output value the model is trying to predict. In regression, it’s a real number. During training, targets are known and guide learning. The model aims to match targets closely.
