Stanford CS329H: Machine Learning from Human Preferences | Autumn 2024 | Mechanism Design
BeginnerKey Summary
- •This lesson explains the core pieces of machine learning: data (X and Y), models f(x;θ), loss functions that measure mistakes, and optimizers that adjust θ to reduce the loss. It divides learning into supervised (with labels), unsupervised (without labels), and reinforcement learning (with rewards). The focus here is on supervised learning, especially regression and classification, plus a short intro to k-means clustering.
- •Regression predicts numbers, like house prices, while classification predicts categories, like cat vs. dog. Linear regression uses a straight line y = w^T x + b to model the relationship. You choose w and b that make predictions close to real values, measured by mean squared error (MSE).
- •Mean squared error averages the squared gaps between predictions and true values. Squaring punishes bigger mistakes more than smaller ones, which helps the model care about large errors. Minimizing MSE finds the best-fit line through the scatter of points.
- •Gradient descent is how we search for good w and b. It starts with guesses and takes steps in the direction that most reduces the loss. A learning rate controls how big each step is; too big can overshoot, and too small makes training slow.
- •An example shows fitting a line to two points, (1,1) and (2,2), by starting with w=0, b=0 and repeatedly updating using gradients. As updates continue, the line rotates and shifts until the MSE stops getting smaller. This point is called convergence.
- •If the data is not well described by a straight line, linear regression may fail. In that case, we can move to nonlinear models like polynomial regression, decision trees, or neural networks. These can curve or split the input space to fit complex patterns.
- •
Why This Lecture Matters
Understanding linear regression, logistic regression, and k-means gives you the keys to the machine learning toolbox. These methods are fast, interpretable, and form a strong baseline for many real problems. Product analysts can forecast metrics (regression) or predict conversions (classification) quickly and clearly explain which features matter. Data scientists can segment customers, content, or behaviors with k-means to drive personalization and marketing strategies without any labels. Engineers can deploy lightweight models where speed and transparency are crucial, and use them to benchmark more complex methods. This knowledge solves common problems: how to define your task (regression vs. classification), how to measure success (MSE, cross-entropy, WCSS), and how to actually train models (gradient descent, convergence). It also shows how to handle unlabeled data and choose hyperparameters like learning rate and k with simple, practical heuristics (learning curves, elbow plots, silhouette scores). Mastering these concepts boosts your career because they transfer to advanced models: decision trees, gradient boosting, and neural networks still use the same ideas—models, losses, and optimizers. In a fast-moving industry, being able to ship reliable, interpretable baselines quickly is a superpower that guides better decisions and more ambitious modeling later on.
Lecture Summary
Tap terms for definitions01Overview
This lesson builds a solid foundation in basic machine learning by focusing on the simplest and most essential algorithms. It starts by clarifying what a machine learning system is: data as input-output pairs (X and Y), a model f(x;θ) with parameters θ, a loss function to measure how wrong predictions are, and an optimizer to tune the parameters to reduce the loss. The lecture quickly reviews the three main learning settings—supervised learning (with labeled targets), unsupervised learning (with no labels), and reinforcement learning (with rewards)—and then dives deeply into supervised learning, with short but clear coverage of unsupervised learning at the end.
Within supervised learning, two problem types are highlighted: regression and classification. Regression predicts continuous values (like house prices), and the simplest approach is linear regression, which fits a straight line to data when there is one input, or a flat plane/linear function when there are multiple inputs (features). The model is f(x; w, b) = w^T x + b. To learn the parameters w and b, you minimize mean squared error (MSE), which averages the squares of the prediction errors. The optimizer of choice here is gradient descent, an iterative method that updates parameters in steps proportional to how much the loss changes with respect to them (their gradients). An example with two data points, (1,1) and (2,2), demonstrates how gradient descent starts with w=0, b=0 and progressively adjusts them until the loss stops decreasing—this is called convergence.
The lecture also answers a common question: what if the relationship isn’t a straight line? Linear regression won’t fit well if the pattern is curved or complex. In those cases, we can use nonlinear models such as polynomial regression, decision trees, or neural networks. The point is to understand linear regression thoroughly as a baseline, because it’s interpretable, fast to run, and a building block for more advanced methods.
For classification, the lesson introduces logistic regression. The idea is similar to linear regression, but instead of directly predicting a number, the model produces a probability by passing a linear score through the sigmoid function σ(z)=1/(1+e^(−z)). This gives outputs between 0 and 1 and allows easy thresholding (e.g., classify as 1 if probability ≥ 0.5). The training objective switches to cross-entropy (also called negative log-likelihood), which heavily penalizes confident wrong predictions and encourages correct high-confidence ones. The gradient-based updates look similar to linear regression but use the predicted probabilities in the derivatives. The instructor uses a tiny dataset, (1,0) and (2,1), to show how learning proceeds from initial guesses to sensible probabilities that separate the classes.
Finally, the lecture introduces unsupervised learning through k-means clustering, which finds structure in data without labels by grouping points into k clusters. K-means alternates between two steps: assignment (send each point to its nearest centroid) and update (move each centroid to the mean of its assigned points). The objective is to minimize the within-cluster sum of squares (WCSS): the total squared distances from points to their centroids. Choosing k is important: the elbow method looks for a bend in the WCSS curve where adding more clusters yields diminishing returns, and the silhouette score compares within-cluster tightness to separation from other clusters, preferring values closer to 1. A 2D point cloud example with k=2 walks through random initialization, assignment, update, and convergence.
By the end, you will understand how to define problems as regression or classification, select linear or logistic regression as strong baselines, compute losses (MSE and cross-entropy), perform gradient descent updates, and run k-means to discover patterns without labels. You’ll also know how to think about hyperparameters like learning rate and k, plus when to consider moving beyond linear models. This knowledge sets the stage for more advanced algorithms like decision trees and neural networks, which build on the same ideas of models, losses, and optimization.
Key Takeaways
- ✓Always start by defining the task: regression for numbers, classification for categories, unsupervised for structure. This choice determines the model, loss, and evaluation. Misclassifying the problem leads to wrong tools and poor results. A clear task definition speeds up everything else.
- ✓Use linear regression as a baseline when predicting numbers. It’s fast, explainable, and often surprisingly strong. Plot predictions vs. true values and compute MSE to judge fit. If patterns look curved, consider nonlinear features next.
- ✓Optimize with gradient descent and pick a sensible learning rate. If loss increases or oscillates, reduce the rate; if progress is slow, increase it a little. Use a validation curve of loss vs. iteration to monitor stability. Stop when improvements flatten.
- ✓Scale features before gradient-based training and k-means. Standardization (mean 0, std 1) prevents large-scale features from dominating. It speeds convergence and improves numerical stability. Unscaled features often cause poor fits and misleading distances.
- ✓Monitor convergence with simple rules. In regression and classification, stop when loss changes less than a tiny threshold for several steps. In k-means, stop when centroids move very little or assignments stop changing. This saves time without losing accuracy.
- ✓For classification, trust probabilities from logistic regression. Cross-entropy trains the model to be accurate and well-calibrated. Choose a threshold that matches your goals (e.g., higher recall vs. higher precision). Evaluate beyond accuracy if classes are imbalanced.
- ✓
Glossary
Supervised Learning
A type of machine learning where each input has a known correct output (label). The goal is to learn a function that maps inputs to outputs. The model uses many examples to spot patterns that connect X to Y. It is used for tasks like predicting prices or classifying emails. It relies on having labeled training data.
Unsupervised Learning
Learning patterns from data that has no labels. The model tries to group, compress, or discover structure in the inputs. It can reveal hidden clusters or relationships. There is no teacher telling the right answer. It is helpful for exploring data and finding segments.
Reinforcement Learning
A learning setup where an agent takes actions and gets rewards or penalties. The goal is to pick actions that maximize long-term rewards. The agent learns by trial and error. There are no direct labels for each input, only feedback signals. It suits tasks like games and robotics.
Model f(x; θ)
A mathematical function that maps input x to output using parameters θ. Parameters are like tunable dials that shape the model’s behavior. Training adjusts θ so outputs match desired targets. Different models use different shapes, like linear or logistic. It is the core of prediction.
