Stanford CS230 | Autumn 2025 | Lecture 5: Deep Reinforcement Learning
BeginnerKey Summary
- β’Logistic regression is a simple method for binary classification that outputs a probability between 0 and 1 for class 1. It takes a weighted sum of input features (w^T x + b) and passes it through the sigmoid function. The sigmoid is an S-shaped curve that squashes any real number into the [0,1] range.
- β’The model is linear in its decision boundary, even though it uses a non-linear sigmoid to produce probabilities. The decision boundary is where w^T x + b = 0, which becomes a straight line (or a flat plane) in feature space. Points on one side are predicted as class 1; on the other side, class 0.
- β’Training uses a special cost function called log loss (cross-entropy), not mean squared error. Log loss is convex for logistic regression, so gradient descent can find the global minimum. Using MSE makes the optimization non-convex and can trap training in bad local minima.
- β’Log loss punishes confident wrong predictions heavily and rewards confident correct ones. If the true label is 1, the loss is βlog(y_hat); if the true label is 0, the loss is βlog(1βy_hat). This makes the model push probabilities toward the correct side without saturating too early.
- β’The training loop follows a repeatable pattern: initialize w and b, compute predictions with sigmoid, compute log loss, compute gradients, and update w and b. Repeat until the loss stops improving (converges). This can be done per example (SGD), in mini-batches, or on the whole dataset (batch GD).
- β’Gradients are simple and elegant: dL/dz = y_hat β y, so dL/dw = x (y_hat β y) and dL/db = (y_hat β y). Here z is w^T x + b, y_hat is sigmoid(z), and y is the true label 0 or 1. These formulas make implementation short and fast.
- β’
Why This Lecture Matters
Logistic regression sits at the heart of many real-world decision systems because it is simple, fast, and outputs probabilities you can trust. Product teams use it to quickly ship features like spam filters, risk scores, and click predictions, since the model trains quickly and scales well. Data analysts appreciate its interpretability: the sign and size of each weight explain how features push outcomes, which helps build stakeholder confidence and guide business changes. Engineers can easily integrate it into pipelines and make threshold-based decisions that reflect costs and benefits, such as catching more fraud versus reducing false alerts. Even when you plan to use more complex models later, logistic regression is the right baseline to set expectations, find data issues early, and provide a stable comparison point. This knowledge also solves common pitfalls in classification. Many beginners try mean squared error and get poor results; understanding why cross-entropy is the right loss prevents wasted time. Grasping convexity and gradient descent makes training predictable and debuggable. Learning to tune thresholds turns raw probabilities into practical actions, matching business goals like high recall or high precision. In career terms, mastering logistic regression proves you understand core ML ideas: modeling probabilities, choosing proper loss functions, optimizing with gradients, and interpreting outputs. These skills transfer directly to more advanced models and are highly valued in the industry.
Lecture Summary
Tap terms for definitions01Overview
This lesson teaches the core ideas and practical steps of logistic regression, a foundational method for binary classification. Binary classification means you have input features (x) and want to predict whether the output label (y) belongs to class 0 or class 1. Logistic regression works by taking a linear combination of inputs (w^T x + b) and then passing that value through the sigmoid function, which squashes any real number into a probability between 0 and 1. This probability is interpreted as the modelβs belief that the input belongs to class 1. Despite using the non-linear sigmoid, logistic regression is still a linear model because its decision boundaryβthe place where the model switches between predicting class 0 and class 1βis a straight line (or a flat plane in higher dimensions).
A major focus is on training the model correctly. Instead of mean squared error (MSE), which is common in linear regression, logistic regression uses a different cost function called log loss (also known as cross-entropy). This choice matters a lot: using MSE with a sigmoid leads to a non-convex optimization surface, where gradient descent can get stuck in bad local minima. In contrast, log loss for logistic regression is convex, so gradient descent has a clear path to the global minimum. The lesson breaks down how log loss behaves: it gives low cost for confident correct predictions and very high cost for confident wrong ones, which is exactly what you want when learning probabilities.
Youβll also see how to train the model step by step. First, you initialize weights and bias (often with small random values). Then, for each training example, you compute the linear score z = w^T x + b, apply the sigmoid to get the probability y_hat, and compute the log loss comparing y_hat to the true label y. Next, you compute gradientsβthe direction to change w and b to reduce the lossβand update the parameters. You repeat this process until the loss stops improving significantly, which is called convergence. This can be done with full-batch gradient descent, mini-batch gradient descent, or stochastic gradient descent (one example at a time), depending on data size and speed needs.
The lesson clearly lays out the strengths and weaknesses of logistic regression. On the plus side, itβs simple, easy to implement, fast to train and predict, and provides probability outputs, which are useful for ranking, thresholding, and decision-making under uncertainty. It is also interpretable: each featureβs weight shows how that feature nudges the odds of class 1 up or down. On the minus side, itβs a linear classifier, so it cannot easily model curved or complex relationships in the data unless you add new features that capture those patterns. If your data is not linearly separable, you may need feature engineering such as polynomial terms or interactions.
Finally, the lesson touches on extending logistic regression beyond two classes. Although the basic model handles two classes, you can adapt it to multi-class problems using strategies like one-vs-all (also called one-vs-rest) or one-vs-one. With one-vs-all, you train one classifier per class against all the others and pick the class with the highest predicted probability at prediction time. The key idea remains: use the sigmoid to produce probabilities, use cross-entropy to train, and use gradient descent to find good weights. By the end, you will understand when logistic regression is a good choice, how to implement it, why cross-entropy is the right loss, and how to interpret its outputs and limitations.
This material is most suitable for beginners who know basic algebra and have seen linear regression before. You donβt need advanced math, but it helps to be comfortable with vectors, simple functions, and the idea of minimizing a loss. After working through this, you will be able to build a binary classifier end-to-end: prepare your data, train a logistic regression model with cross-entropy loss, make predictions, choose thresholds, and understand what the model is doing under the hood. You can use these skills directly in real projects like spam detection, medical test prediction, and click-through rate estimation, or as a springboard to more advanced models later.
Key Takeaways
- βStart with cross-entropy, not MSE. Cross-entropy is convex for logistic regression and gives stable training, while MSE makes optimization bumpy and unreliable with sigmoid outputs. If your training stalls or behaves oddly, check that youβre using the correct loss. This single choice often decides whether learning succeeds.
- βVectorize your code for speed and clarity. Use matrix operations like X @ w to compute z for all samples at once. Compute gradients with X.T @ (y_hat β y) to avoid slow loops. Fewer lines mean fewer bugs and faster training.
- βUse stable math for logs and sigmoid. Clip y_hat into [1eβ15, 1β1eβ15] before taking logs to avoid numerical errors. Implement a stable sigmoid for large |z| values to prevent overflow. Stable code saves hours of confusing debugging.
- βTune the learning rate carefully. If loss explodes or oscillates, reduce it; if training crawls, increase it slightly. Try a small set like {0.1, 0.01, 0.001} and pick the fastest that stays stable. Good learning rates speed convergence dramatically.
- βMonitor training and stop when it plateaus. Track loss by epoch; if improvements become tiny for several checks, stop. Early stopping saves time and avoids overfitting to noise. It also helps compare different setups fairly.
- βScale features to help optimization. Standardize or normalize features so one doesnβt dominate the gradient. This often leads to faster, smoother convergence. Especially important when features have very different ranges.
- βSet thresholds to match goals. 0.5 is a starting point, not a rule. If missing positives is costly, lower the threshold; if false alarms are costly, raise it. Threshold tuning turns raw probabilities into business-aligned actions.
Glossary
Logistic Regression
A model that predicts the chance that something belongs to class 1. It uses a weighted sum of inputs and a sigmoid function to make a probability. The output is always between 0 and 1, like a percentage. It is simple, fast, and easy to understand. Its decision boundary is linear.
Binary Classification
A problem with only two possible answers: 0 or 1. The model predicts if an input is in the positive group or not. You give the model features, and it returns a probability for class 1. This is common in everyday tasks. Examples include yes/no and on/off decisions.
Feature
A measurable input to the model, like a number or a yes/no flag. Features describe the object you want to classify. The model uses them to make a decision. Good features make learning easier. Poor features make learning hard.
Weight (w)
A number that shows how important a feature is to the prediction. Positive weights push the probability up; negative weights push it down. Each feature has a matching weight. Together they form a weighted sum. We learn weights during training.
