Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 4 - LLM Training
BeginnerKey Summary
- •This lecture explains how we train neural networks by minimizing a loss function using optimization methods. It starts with gradient descent and stochastic gradient descent (SGD), showing how we update parameters by stepping opposite to the gradient. Mini-batches make training faster and add helpful noise that can escape bad spots in the loss landscape called local minima.
- •Momentum speeds up learning by accumulating a velocity that keeps moving in useful directions. If gradients point the same way across steps, momentum takes larger strides; if gradients flip, it slows down. This helps smooth zig-zag paths and speeds convergence on long, sloped valleys.
- •RMSprop adapts the learning rate per parameter using a moving average of squared gradients. Parameters with big gradients get smaller steps; parameters with small gradients get larger steps. This balances learning across the network and improves stability.
- •Adam combines momentum’s first-moment tracking with RMSprop’s second-moment tracking. It also applies bias correction to fix early-step underestimation. With defaults beta1=0.9, beta2=0.999, and epsilon=1e-8, Adam generally works well with minimal tuning.
- •Learning rate decay schedules make training start fast and then carefully fine-tune. Step decay reduces the learning rate by a fixed factor every set number of epochs. Exponential decay shrinks it steadily over time, while cosine annealing smoothly lowers it following a cosine curve.
- •Regularization fights overfitting, which is when a model memorizes training data and fails to generalize. L2 (weight decay) penalizes large weights to keep the model simpler and more stable. L1 encourages sparsity by pushing many weights to be exactly zero.
- •
Why This Lecture Matters
Training neural networks well is central to almost every modern AI application, from image classification to language modeling. If you understand optimizers like Momentum, RMSprop, and Adam, you can make models learn faster, more stably, and with less trial-and-error. Learning rate schedules enable you to sprint early and fine-tune late, often achieving higher accuracy without extra data. Regularization methods such as L1, L2, and Dropout directly tackle overfitting, which is one of the biggest obstacles to real-world deployment; they help models perform reliably on new data. Batch normalization further stabilizes and speeds training, allowing you to use more aggressive learning rates and deeper architectures. This knowledge is useful for roles like machine learning engineer, data scientist, and research scientist, as well as software engineers adding ML features to products. It solves practical problems like unstable training runs, models that memorize but don’t generalize, and long tuning cycles with little progress. Applying these techniques can reduce compute costs, shorten development time, and improve system reliability. In your projects, start with Adam, pair it with a sensible learning rate decay (cosine annealing is a strong choice), add L2 and dropout if overfitting appears, and use batch norm to improve stability. Monitor training and validation curves to guide decisions rather than guessing. Mastering these essentials strengthens your career foundation, since nearly all deep learning work depends on getting training right. In today’s industry, where models are larger and data is abundant, efficient and robust training is a competitive advantage that directly impacts performance and time-to-market.
Lecture Summary
Tap terms for definitions01Overview
This lecture teaches you how to effectively train artificial neural networks by walking through core optimization algorithms, learning rate strategies, regularization methods, and stability techniques used in practice. It begins with a short recap of gradient descent (GD) and stochastic gradient descent (SGD), which update model parameters by moving in the opposite direction of the gradient of a loss function. You will see why mini-batches are the practical way to compute gradients and how the noise they introduce can help a model escape poor local minima. Then it moves into more advanced optimizers—Momentum, RMSprop, and Adam—explaining how each one improves training speed and stability by smoothing or adapting learning steps.
Next, the lecture covers learning rate decay schedules. These schedules start with a relatively large learning rate so training can make fast progress early on, and then reduce the learning rate over time to refine the solution. You’ll learn three common methods: step decay (reduce by a fixed factor every few epochs), exponential decay (shrink continuously by a constant rate per epoch), and cosine annealing (smoothly decreasing the rate following a cosine curve from a maximum to a minimum). Each method provides a different shape of decrease, and the idea is to achieve both quick initial learning and careful fine-tuning near the end.
Regularization is another big topic in this lecture. Regularization is crucial when a model might memorize the training data and not generalize well to new data, a problem known as overfitting. You’ll learn L2 regularization (also called weight decay), which encourages small weights by adding a penalty on the sum of squared weights, and L1 regularization, which encourages sparsity by penalizing the sum of absolute values of weights. The lecture also explains Dropout, a method where you randomly set some neurons’ outputs to zero during training. Dropout acts as a kind of ensemble, preventing any single neuron from dominating and improving generalization; at test time, no neurons are dropped.
To improve training stability and speed, the lecture introduces batch normalization. Batch normalization normalizes the activations of each layer using the mean and variance computed over the current mini-batch. After normalization, it applies a learned scale (gamma) and shift (beta), typically starting at gamma=1 and beta=0. This process allows the model to stabilize intermediate values in the network, often enabling larger learning rates and faster convergence.
The lecture closes with practical, experience-based tips. Choose an architecture appropriate to the task and the amount of data: deeper and wider networks are more powerful but require more data and regularization. Preprocess inputs by normalizing or scaling features and removing obvious outliers because neural networks are sensitive to feature scales. Initialize weights properly—randomly and with methods like Xavier or He initialization—to maintain healthy signal flow. Pick a reasonable learning rate and consider adding a decay schedule. Monitor training and validation losses to detect overfitting, visualize weights to catch issues early, and always maintain a representative validation set for tuning hyperparameters like learning rate and regularization strength.
Who is this lecture for? It is aimed at beginners to intermediate learners who have basic familiarity with machine learning and neural networks (e.g., knowing what a loss function is and what a model’s parameters are). You do not need deep math to benefit, but comfort with gradients and the idea of optimization helps. After completing the lecture, you will be able to pick and configure optimizers like Momentum, RMSprop, and Adam; design and apply learning rate decay schedules; use regularization techniques such as L1, L2, and Dropout; apply batch normalization; and follow a practical training workflow with data preprocessing, weight initialization, monitoring, and validation.
The lecture is structured as follows: it starts with a recap of basic gradient-based training and mini-batch SGD. It then introduces three optimizers—Momentum, RMSprop, and Adam—explaining their update rules and intuitions. Next come learning rate decay strategies, including step, exponential, and cosine annealing, with their formulas and roles. After that, it dives into regularization: L2, L1, and Dropout, with the effects each has on weights and generalization. Then it explains batch normalization and why it improves training stability and performance, including typical initialization of its parameters. Finally, it wraps up with a checklist of practical training tips: architecture selection, preprocessing, initialization, learning rate choice, monitoring, weight visualization, and using a proper validation set.
Key Takeaways
- ✓Start with a clear training loop: preprocess data, initialize weights well, pick a strong optimizer (often Adam), and choose a sensible initial learning rate. Use mini-batches to balance speed and gradient quality. Track both training and validation metrics every epoch. Save the best model based on validation performance.
- ✓Use momentum to smooth and accelerate learning, especially when gradients zig-zag. A momentum coefficient around 0.9 is a common starting point. Watch for overshooting if the learning rate is high. If training oscillates, slightly reduce the learning rate or momentum.
- ✓Adopt RMSprop or Adam when gradients vary widely across parameters. They adapt per-parameter learning rates automatically. Set epsilon to a small positive value (e.g., 1e−8) for stability. If training is unstable, try reducing the base learning rate before changing other settings.
- ✓Adam is a robust default: beta1=0.9, beta2=0.999, epsilon=1e−8 often work well. Begin with a base learning rate around 1e−3 and adjust based on loss curves. If loss plateaus early, try a slightly higher rate or a better schedule. If loss spikes, lower the learning rate.
- ✓Always use a learning rate schedule: step, exponential, or cosine annealing. Cosine annealing provides smooth decay and often better final accuracy. Step decay is simple and effective for staged training. Exponential decay is a good smooth alternative if step timing is unclear.
- ✓Apply L2 regularization (weight decay) to discourage overly large weights, improving stability and generalization. Start with small values like 1e−4 to 1e−5. If overfitting persists, increase L2 or add dropout. Balance regularization to avoid underfitting.
Glossary
Loss function
A loss function is a number that tells us how wrong the model’s predictions are compared to the true answers. Lower loss means better predictions. It’s the target we try to minimize during training. By adjusting the model’s parameters to reduce this number, we improve the model. Without it, we wouldn’t know how to change the model to get better.
Parameters (theta)
Parameters are the adjustable numbers inside a neural network, like the weights and biases of its connections. They decide how input signals are transformed as they pass through layers. During training, we change these numbers to make the loss smaller. Good parameter values lead to good predictions. Too many parameters can overfit if we lack regularization.
Gradient descent (GD)
Gradient descent is a method to reduce loss by moving parameters in the direction that lowers it the most. We compute the gradient (slope) of the loss and step opposite to it. The step size is set by the learning rate. Repeating this moves us downhill on the loss landscape. If steps are too big, we can overshoot; too small, and progress is slow.
Stochastic Gradient Descent (SGD)
SGD updates parameters using the gradient computed from a small random batch of data rather than the entire dataset. This makes each step faster and adds noise that can help escape bad local minima. It’s the standard way to train large models. The learning rate still controls step size. Batch size impacts gradient noise and stability.
