Learning Rate Schedules
Key Points
- •Learning rate schedules control how fast a model learns over time by changing the learning rate across iterations or epochs.
- •Step decay reduces the learning rate suddenly at fixed milestones, like walking down a staircase in speed.
- •Cosine annealing smoothly decreases the learning rate using a cosine curve, often yielding better convergence in practice.
- •Warmup starts training with a small learning rate and gradually increases it to avoid unstable early updates.
- •You can combine warmup with step decay or cosine annealing to get both stability at the start and good final convergence.
- •Implementations are lightweight: computing the scheduled learning rate per step is O(1) time and space.
- •Choosing schedule hyperparameters (initial LR, decay factor, step size, cosine period, warmup length) matters more than the compute cost.
- •Always log and checkpoint the schedule state to ensure reproducibility and correct resumption of training.
Prerequisites
- →Gradient and derivative — Learning rate scales gradient-based updates; understanding derivatives explains why step sizes matter.
- →Stochastic gradient descent (SGD) — Schedules plug into SGD-like update rules as \eta_t, directly affecting parameter updates.
- →Epochs, steps, and batches — Schedules are functions of time; knowing how steps relate to epochs prevents unit mismatches.
- →Momentum and adaptive optimizers — Different optimizers interact differently with LR schedules; tuning must consider these interactions.
- →Overfitting and generalization — Schedules are partly chosen to improve generalization via late-stage small updates.
- →Floating-point precision — Very small learning rates may underflow or become numerically ineffective; awareness prevents errors.
- →Basic trigonometry — Cosine annealing uses the cosine function; understanding its shape aids intuition and implementation.
- →Piecewise functions — Warmup + main schedules are implemented as piecewise definitions with time offsets.
Detailed Explanation
Tap terms for definitions01Overview
Hook: Imagine trying to learn a dance. If you start too fast, you trip; if you never slow down, you never master the fine details. Learning rate schedules are your choreography for speed—start, slow down, and finish at the right tempo. Concept: A learning rate (LR) schedule is a rule that changes the LR during training instead of keeping it constant. This matters because high LRs help you explore quickly at the beginning, while lower LRs help you carefully refine at the end. In practice, schedules are simple functions of the training step or epoch number that return the LR to use now. Popular choices include step decay (drop the LR at milestones), cosine annealing (smoothly curve the LR down), and warmup (ramp up from small to target LR early on). Example: Suppose you train for 90 epochs. You might start with LR 0.1, drop it by a factor of 10 at epochs 30 and 60 (step decay), or instead slide it smoothly using cosine from 0.1 down to 0.0001 by the end. If your model is unstable at the start, you add 5 epochs of warmup from 0 to 0.1 before applying the decay. By adjusting only a few numbers, you can often train faster, converge better, and reduce sensitivity to other hyperparameters.
02Intuition & Analogies
Hook: Think of boiling pasta. You crank the heat to get water boiling (fast progress), but once it’s rolling, you lower the flame so the pot doesn’t overflow (stable refinement). Concept: The learning rate is like the heat under the pot. A schedule is the thermostat program—how you tweak the heat over time. High LR early = big, bold steps that cover ground quickly; low LR late = tiny, precise steps that don’t overshoot. Analogy 1 (Step Decay): It’s like descending a staircase—walk fast on a landing (constant LR), then intentionally step down to a lower level (sudden LR drop) to be more careful. Analogy 2 (Cosine Annealing): Think of easing off the gas pedal smoothly as you approach a stop sign; the velocity follows a gentle curve (cosine) so you don’t jerk the car. Analogy 3 (Warmup): When lifting weights, you don’t start with your max; you warm up gradually to avoid injury—training behaves similarly because gradients can be volatile at the start. Example: With warmup + cosine, you gently increase speed for a short time (warmup), cruise briefly, and then smoothly decelerate following a cosine path to a near-stop (tiny LR). The combined effect is fewer early crashes (numerical instability or divergence) and better late-stage polishing (improved generalization), much like a well-timed drive that avoids sudden braking or acceleration.
03Formal Definition
04When to Use
Use step decay when you want simplicity and predictable behavior, especially for problems where validation accuracy improves in phases (e.g., classical image classification training with known milestone epochs). It works well when domain practice has established good milestones and decay factors. Choose cosine annealing when you prefer smooth transitions that can lead to better minima and fewer oscillations, especially in modern deep networks. It is robust when you are unsure about exact decay times but know the total budget (epochs or steps). Apply warmup whenever early training is unstable: very deep networks, large batch sizes, layer norm statistics not yet stable, or aggressive initial LRs. Warmup prevents large destructive updates at the start. Combine warmup + cosine for a safe, effective default in many settings; add warm restarts (cosine cycles) if you want periodic exploration that can escape sharp minima. If training is short and compute budget is fixed, use schedules defined over the total steps (linear or cosine) so the LR reaches its floor by the end. If training is long and you checkpoint/resume, ensure your schedule is step-based (not epoch-based) or that you carefully restore the schedule state.
⚠️Common Mistakes
- Dropping the LR too soon or too much: Over-aggressive decay can stall learning. Validate with a learning rate range test or tune decay factors (e.g., (\gamma=0.1) vs. (0.2)).
- Using wrong units: Mixing epochs and steps causes off-by-factor errors. Decide if t counts steps or epochs and stay consistent.
- Forgetting warmup for large batches: Large batches need smaller effective LR at the start. Without warmup, losses may explode or plateau.
- Not restoring schedule state on resume: Schedules with milestones, cosine cycles, or warmup need the correct t on resume; otherwise, LR jumps unexpectedly.
- Clipping LR below machine precision: If (\eta_{\min}) is too tiny, updates become no-ops. Set a reasonable floor (e.g., 1e-6 to 1e-8) or stop decay.
- Ignoring optimizer interaction: Adaptive optimizers (Adam/AdamW) already scale updates; too aggressive decay can double-dampen. Tune base LR and schedule accordingly.
- Hard steps causing instability with momentum: Step decay paired with high momentum can cause temporary overshoot. Consider smoother decay or reduce momentum near drops.
- Overfitting to a schedule: If validation improves only right after LR drops, you might be overfitting to milestone timing. Prefer smoother schedules or auto-tuned schedulers when possible.
Key Formulas
SGD Update
Explanation: Parameters move opposite the gradient scaled by the current learning rate. The learning rate schedule determines at each step t.
Step Decay
Explanation: The learning rate drops by a factor every s steps (or epochs). The floor function counts how many drops have occurred.
Milestone Step Decay
Explanation: The rate decays by each time t crosses a milestone . The indicator counts how many milestones have been passed.
Cosine Annealing
Explanation: The learning rate follows a half-cosine from at to at . This produces a smooth, jerk-free decay.
Linear Warmup
Explanation: During warmup, increase the LR linearly from a small start to the target LR. After warmup, switch to the main schedule.
Warmup + Main Schedule (Piecewise)
Explanation: Compose warmup and a main schedule by using warmup first, then feeding time offset into the main schedule. This is a common implementation pattern.
Cosine With Warm Restarts
Explanation: Time is partitioned into cycles of lengths ; within each cycle the LR follows cosine, then restarts to . This can help escape sharp minima.
Exponential Decay
Explanation: The learning rate decays continuously and smoothly at a rate controlled by . Useful when you want a simple monotone decrease without steps.
Linear Decay
Explanation: The LR decreases linearly from to 0 over T steps. It is simple and reaches zero exactly at step T.
Polynomial Decay
Explanation: Generalizes linear decay; larger p makes the LR drop slower initially and faster near the end. Choose p based on desired curvature.
Inverse Time Decay
Explanation: The LR decreases inversely with time, offering a long tail of small but nonzero steps. It is common in convex optimization theory.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Step Decay scheduler: eta_t = eta0 * gamma^{floor(t / step_size)} 5 struct StepDecay { 6 double eta0; // initial learning rate 7 double gamma; // decay factor (0 < gamma < 1) 8 int step_size; // drop every step_size steps 9 StepDecay(double eta0_, double gamma_, int step_size_) 10 : eta0(eta0_), gamma(gamma_), step_size(step_size_) {} 11 double operator()(int t) const { 12 int k = t / step_size; // number of decays so far 13 // pow with small integer exponent is fine; could precompute if desired 14 return eta0 * pow(gamma, k); 15 } 16 }; 17 18 // Minimize f(x) = x^2 with gradient 2x using SGD-like updates 19 int main() { 20 // Hyperparameters 21 const double x_init = 10.0; // start far from optimum (0) 22 const int T = 100; // total steps 23 StepDecay sched(0.2, 0.1, 30); // eta0=0.2, drop x10 every 30 steps 24 25 double x = x_init; 26 for (int t = 0; t < T; ++t) { 27 double lr = sched(t); 28 double grad = 2.0 * x; // derivative of x^2 29 x -= lr * grad; // SGD update: x := x - lr * grad 30 if (t % 10 == 0) { 31 cout << "t=" << t << ", lr=" << lr << ", x=" << x 32 << ", f(x)=" << x * x << "\n"; 33 } 34 } 35 cout << "Final x=" << x << ", f(x)=" << x * x << "\n"; 36 return 0; 37 } 38
This program uses a step decay learning rate to minimize the simple function f(x)=x^2. The learning rate starts at 0.2 and drops by a factor of 10 every 30 steps. You can see faster progress early on and finer adjustments after each drop, mimicking common deep learning training regimens where performance jumps occur after milestone decays.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct LinearWarmup { 5 double eta_start, eta_target; int T_warm; 6 LinearWarmup(double s, double t, int Tw): eta_start(s), eta_target(t), T_warm(Tw) {} 7 double operator()(int t) const { 8 if (T_warm <= 0) return eta_target; 9 if (t >= T_warm) return eta_target; 10 double alpha = static_cast<double>(t) / max(1, T_warm); 11 return eta_start + alpha * (eta_target - eta_start); 12 } 13 }; 14 15 struct CosineAnnealing { 16 double eta_max, eta_min; int T_max; 17 CosineAnnealing(double mx, double mn, int T): eta_max(mx), eta_min(mn), T_max(T) {} 18 double operator()(int t) const { 19 int tt = min(t, T_max); 20 double cosv = cos(M_PI * (static_cast<double>(tt) / max(1, T_max))); 21 return eta_min + 0.5 * (eta_max - eta_min) * (1.0 + cosv); 22 } 23 }; 24 25 // Compose: use warmup for first T_warm steps, then cosine (with time offset) 26 struct WarmupThenCosine { 27 LinearWarmup warm; CosineAnnealing cos; int T_warm; 28 WarmupThenCosine(LinearWarmup w, CosineAnnealing c) 29 : warm(w), cos(c), T_warm(w.T_warm) {} 30 double operator()(int t) const { 31 if (t <= T_warm) return warm(t); 32 return cos(t - T_warm); 33 } 34 }; 35 36 int main() { 37 // Settings: warm up to 0.1 in 20 steps, then cosine down to 1e-4 over 200 steps 38 WarmupThenCosine sched( 39 LinearWarmup(0.0, 0.1, 20), 40 CosineAnnealing(0.1, 1e-4, 200) 41 ); 42 43 double x = 10.0; // minimize f(x)=x^2 44 const int T_total = 250; 45 for (int t = 0; t < T_total; ++t) { 46 double lr = sched(t); 47 double grad = 2.0 * x; 48 x -= lr * grad; 49 if (t % 25 == 0) { 50 cout << "t=" << t << ", lr=" << lr << ", x=" << x 51 << ", f(x)=" << x * x << "\n"; 52 } 53 } 54 cout << "Final x=" << x << ", f(x)=" << x * x << "\n"; 55 return 0; 56 } 57
This example composes a linear warmup with cosine annealing. The LR ramps from 0 to 0.1 over 20 steps, then smoothly decays toward 1e-4 following a cosine curve over 200 steps. The warmup stabilizes early updates; the cosine provides a smooth, effective decay for later fine-tuning.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Cosine with warm restarts: cycles of length T_i; within a cycle use cosine from eta_max to eta_min 5 struct CosineWithRestarts { 6 double eta_max, eta_min; 7 vector<int> cycle_lengths; // e.g., {50, 50, 100} or geometric growth 8 CosineWithRestarts(double mx, double mn, vector<int> L): eta_max(mx), eta_min(mn), cycle_lengths(move(L)) {} 9 double operator()(int t) const { 10 int acc = 0; 11 for (size_t i = 0; i < cycle_lengths.size(); ++i) { 12 int Ti = cycle_lengths[i]; 13 if (t < acc + Ti) { 14 int tprime = t - acc; 15 double cosv = cos(M_PI * (static_cast<double>(tprime) / max(1, Ti))); 16 return eta_min + 0.5 * (eta_max - eta_min) * (1.0 + cosv); 17 } 18 acc += Ti; 19 } 20 // If beyond provided cycles, stay at eta_min (or repeat last cycle pattern) 21 return eta_min; 22 } 23 }; 24 25 int main() { 26 CosineWithRestarts sched(0.1, 1e-4, {50, 50, 100}); // three cycles 27 28 // Print the LR curve to visualize cycles 29 for (int t = 0; t < 220; ++t) { 30 double lr = sched(t); 31 if (t % 10 == 0) { 32 cout << "t=" << t << ", lr=" << lr << "\n"; 33 } 34 } 35 return 0; 36 } 37
This demo shows cosine annealing with warm restarts. The learning rate follows a cosine within each cycle and then restarts to the maximum at the next cycle boundary. This can re-expand the step size periodically to escape sharp minima while still decaying within each cycle.