🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
⚙️AlgorithmIntermediate

Learning Rate Schedules

Key Points

  • •
    Learning rate schedules control how fast a model learns over time by changing the learning rate across iterations or epochs.
  • •
    Step decay reduces the learning rate suddenly at fixed milestones, like walking down a staircase in speed.
  • •
    Cosine annealing smoothly decreases the learning rate using a cosine curve, often yielding better convergence in practice.
  • •
    Warmup starts training with a small learning rate and gradually increases it to avoid unstable early updates.
  • •
    You can combine warmup with step decay or cosine annealing to get both stability at the start and good final convergence.
  • •
    Implementations are lightweight: computing the scheduled learning rate per step is O(1) time and space.
  • •
    Choosing schedule hyperparameters (initial LR, decay factor, step size, cosine period, warmup length) matters more than the compute cost.
  • •
    Always log and checkpoint the schedule state to ensure reproducibility and correct resumption of training.

Prerequisites

  • →Gradient and derivative — Learning rate scales gradient-based updates; understanding derivatives explains why step sizes matter.
  • →Stochastic gradient descent (SGD) — Schedules plug into SGD-like update rules as \eta_t, directly affecting parameter updates.
  • →Epochs, steps, and batches — Schedules are functions of time; knowing how steps relate to epochs prevents unit mismatches.
  • →Momentum and adaptive optimizers — Different optimizers interact differently with LR schedules; tuning must consider these interactions.
  • →Overfitting and generalization — Schedules are partly chosen to improve generalization via late-stage small updates.
  • →Floating-point precision — Very small learning rates may underflow or become numerically ineffective; awareness prevents errors.
  • →Basic trigonometry — Cosine annealing uses the cosine function; understanding its shape aids intuition and implementation.
  • →Piecewise functions — Warmup + main schedules are implemented as piecewise definitions with time offsets.

Detailed Explanation

Tap terms for definitions

01Overview

Hook: Imagine trying to learn a dance. If you start too fast, you trip; if you never slow down, you never master the fine details. Learning rate schedules are your choreography for speed—start, slow down, and finish at the right tempo. Concept: A learning rate (LR) schedule is a rule that changes the LR during training instead of keeping it constant. This matters because high LRs help you explore quickly at the beginning, while lower LRs help you carefully refine at the end. In practice, schedules are simple functions of the training step or epoch number that return the LR to use now. Popular choices include step decay (drop the LR at milestones), cosine annealing (smoothly curve the LR down), and warmup (ramp up from small to target LR early on). Example: Suppose you train for 90 epochs. You might start with LR 0.1, drop it by a factor of 10 at epochs 30 and 60 (step decay), or instead slide it smoothly using cosine from 0.1 down to 0.0001 by the end. If your model is unstable at the start, you add 5 epochs of warmup from 0 to 0.1 before applying the decay. By adjusting only a few numbers, you can often train faster, converge better, and reduce sensitivity to other hyperparameters.

02Intuition & Analogies

Hook: Think of boiling pasta. You crank the heat to get water boiling (fast progress), but once it’s rolling, you lower the flame so the pot doesn’t overflow (stable refinement). Concept: The learning rate is like the heat under the pot. A schedule is the thermostat program—how you tweak the heat over time. High LR early = big, bold steps that cover ground quickly; low LR late = tiny, precise steps that don’t overshoot. Analogy 1 (Step Decay): It’s like descending a staircase—walk fast on a landing (constant LR), then intentionally step down to a lower level (sudden LR drop) to be more careful. Analogy 2 (Cosine Annealing): Think of easing off the gas pedal smoothly as you approach a stop sign; the velocity follows a gentle curve (cosine) so you don’t jerk the car. Analogy 3 (Warmup): When lifting weights, you don’t start with your max; you warm up gradually to avoid injury—training behaves similarly because gradients can be volatile at the start. Example: With warmup + cosine, you gently increase speed for a short time (warmup), cruise briefly, and then smoothly decelerate following a cosine path to a near-stop (tiny LR). The combined effect is fewer early crashes (numerical instability or divergence) and better late-stage polishing (improved generalization), much like a well-timed drive that avoids sudden braking or acceleration.

03Formal Definition

Let t denote the discrete training time index (step or epoch). A learning rate schedule is a function \( ηt​ : N → R+​ \) that returns the LR at time t. Parameter updates (e.g., SGD) follow \( wt+1​=wt​ - ηt​ \, ∇ L(wt​; ξt​) \), where \(ξt​\) is the sampled minibatch. Step decay defines a base learning rate \(η0​\), a decay factor \(γ ∈ (0,1)\), and a step size s (in steps or epochs), producing \( ηt​ = η0​ ⋅ γ⌊t/s⌋ \). Milestone-based variants use a set of integers \(\{mi​\}\) and \( ηt​ = η0​ ⋅ γ∑i​1t≥mi​​ \). Cosine annealing defines upper and lower bounds \(ηmax​, ηmin​\) and a horizon \(Tmax​\), with \( ηt​ = ηmin​ + 21​(ηmax​ - ηmin​) \big(1 + cos(π t / Tmax​)\big) \) for \(0 ≤ t ≤ Tmax​\). Warmup is a short initial phase of length \(T_{warm}\) that ramps \(ηt​\) from \(ηstart​\) to a target (often \(η0​\)) via a simple rule (commonly linear). Schedules can be composed piecewise: use warmup for \(t ≤ T_{warm}\) and then switch to step decay or cosine for \(t > T_{\mathrm{warm}}\). Variants like cosine with warm restarts partition time into cycles and reset t at cycle boundaries to re-expand the learning rate periodically.

04When to Use

Use step decay when you want simplicity and predictable behavior, especially for problems where validation accuracy improves in phases (e.g., classical image classification training with known milestone epochs). It works well when domain practice has established good milestones and decay factors. Choose cosine annealing when you prefer smooth transitions that can lead to better minima and fewer oscillations, especially in modern deep networks. It is robust when you are unsure about exact decay times but know the total budget (epochs or steps). Apply warmup whenever early training is unstable: very deep networks, large batch sizes, layer norm statistics not yet stable, or aggressive initial LRs. Warmup prevents large destructive updates at the start. Combine warmup + cosine for a safe, effective default in many settings; add warm restarts (cosine cycles) if you want periodic exploration that can escape sharp minima. If training is short and compute budget is fixed, use schedules defined over the total steps (linear or cosine) so the LR reaches its floor by the end. If training is long and you checkpoint/resume, ensure your schedule is step-based (not epoch-based) or that you carefully restore the schedule state.

⚠️Common Mistakes

  • Dropping the LR too soon or too much: Over-aggressive decay can stall learning. Validate with a learning rate range test or tune decay factors (e.g., (\gamma=0.1) vs. (0.2)).
  • Using wrong units: Mixing epochs and steps causes off-by-factor errors. Decide if t counts steps or epochs and stay consistent.
  • Forgetting warmup for large batches: Large batches need smaller effective LR at the start. Without warmup, losses may explode or plateau.
  • Not restoring schedule state on resume: Schedules with milestones, cosine cycles, or warmup need the correct t on resume; otherwise, LR jumps unexpectedly.
  • Clipping LR below machine precision: If (\eta_{\min}) is too tiny, updates become no-ops. Set a reasonable floor (e.g., 1e-6 to 1e-8) or stop decay.
  • Ignoring optimizer interaction: Adaptive optimizers (Adam/AdamW) already scale updates; too aggressive decay can double-dampen. Tune base LR and schedule accordingly.
  • Hard steps causing instability with momentum: Step decay paired with high momentum can cause temporary overshoot. Consider smoother decay or reduce momentum near drops.
  • Overfitting to a schedule: If validation improves only right after LR drops, you might be overfitting to milestone timing. Prefer smoother schedules or auto-tuned schedulers when possible.

Key Formulas

SGD Update

wt+1​=wt​−ηt​∇L(wt​;ξt​)

Explanation: Parameters move opposite the gradient scaled by the current learning rate. The learning rate schedule determines ηt​ at each step t.

Step Decay

ηt​=η0​⋅γ⌊t/s⌋

Explanation: The learning rate drops by a factor γ every s steps (or epochs). The floor function counts how many drops have occurred.

Milestone Step Decay

ηt​=η0​⋅γ∑i=1k​1{t≥mi​}​

Explanation: The rate decays by γ each time t crosses a milestone mi​. The indicator counts how many milestones have been passed.

Cosine Annealing

ηt​=ηmin​+21​(ηmax​−ηmin​)(1+cos(πTmax​t​))

Explanation: The learning rate follows a half-cosine from ηmax​ at t=0 to ηmin​ at t=Tmax​. This produces a smooth, jerk-free decay.

Linear Warmup

ηt​=ηstart​+Twarm​t​(ηtarget​−ηstart​),0≤t≤Twarm​

Explanation: During warmup, increase the LR linearly from a small start to the target LR. After warmup, switch to the main schedule.

Warmup + Main Schedule (Piecewise)

ηt​={ηwarm​(t),ηmain​(t−Twarm​),​t≤Twarm​t>Twarm​​

Explanation: Compose warmup and a main schedule by using warmup first, then feeding time offset into the main schedule. This is a common implementation pattern.

Cosine With Warm Restarts

ηt​=ηmin​+21​(ηmax​−ηmin​)(1+cos(πTi​t′​)),t′=t−j<i∑​Tj​

Explanation: Time is partitioned into cycles of lengths Ti​; within each cycle the LR follows cosine, then restarts to ηmax​. This can help escape sharp minima.

Exponential Decay

ηt​=η0​e−λt

Explanation: The learning rate decays continuously and smoothly at a rate controlled by λ. Useful when you want a simple monotone decrease without steps.

Linear Decay

ηt​=η0​(1−Tt​)

Explanation: The LR decreases linearly from η0​ to 0 over T steps. It is simple and reaches zero exactly at step T.

Polynomial Decay

ηt​=η0​(1−Tt​)p

Explanation: Generalizes linear decay; larger p makes the LR drop slower initially and faster near the end. Choose p based on desired curvature.

Inverse Time Decay

ηt​=1+λtη0​​

Explanation: The LR decreases inversely with time, offering a long tail of small but nonzero steps. It is common in convex optimization theory.

Complexity Analysis

Computing a learning rate from a schedule is constant-time and constant-space per step. For step decay, we evaluate a floor division and a power with a small integer exponent (often implemented as repeated multiplication or precomputed), which is O(1) time and O(1) space. Cosine annealing requires a cosine evaluation and a few arithmetic operations, still O(1) per step with negligible extra memory. Linear warmup is a simple affine computation, also O(1). Over T training steps, the total overhead to produce all scheduled learning rates is O(T) time and O(1) space beyond the optimizer state. In practice, the computational cost of schedules is tiny compared to forward/backward passes in neural networks. However, there are hidden complexity considerations. First, schedules defined in epochs vs. steps must map correctly; converting epochs to steps multiplies T by steps-per-epoch but does not change asymptotic complexity. Second, milestone lookups should be O(1) or O(log M) where M is the number of milestones; storing milestones in a sorted vector and using a pointer that advances as t increases achieves amortized O(1). Cosine with warm restarts requires determining the current cycle; maintaining the cycle index and remaining steps per cycle keeps this O(1). Finally, checkpoints should store the current time index and any cycle counters to restart without recomputing history. Overall, schedule evaluation does not bottleneck training; the main concern is numerical stability and correct bookkeeping rather than runtime complexity.

Code Examples

Step Decay SGD on a 1D Quadratic (toy optimization)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Step Decay scheduler: eta_t = eta0 * gamma^{floor(t / step_size)}
5struct StepDecay {
6 double eta0; // initial learning rate
7 double gamma; // decay factor (0 < gamma < 1)
8 int step_size; // drop every step_size steps
9 StepDecay(double eta0_, double gamma_, int step_size_)
10 : eta0(eta0_), gamma(gamma_), step_size(step_size_) {}
11 double operator()(int t) const {
12 int k = t / step_size; // number of decays so far
13 // pow with small integer exponent is fine; could precompute if desired
14 return eta0 * pow(gamma, k);
15 }
16};
17
18// Minimize f(x) = x^2 with gradient 2x using SGD-like updates
19int main() {
20 // Hyperparameters
21 const double x_init = 10.0; // start far from optimum (0)
22 const int T = 100; // total steps
23 StepDecay sched(0.2, 0.1, 30); // eta0=0.2, drop x10 every 30 steps
24
25 double x = x_init;
26 for (int t = 0; t < T; ++t) {
27 double lr = sched(t);
28 double grad = 2.0 * x; // derivative of x^2
29 x -= lr * grad; // SGD update: x := x - lr * grad
30 if (t % 10 == 0) {
31 cout << "t=" << t << ", lr=" << lr << ", x=" << x
32 << ", f(x)=" << x * x << "\n";
33 }
34 }
35 cout << "Final x=" << x << ", f(x)=" << x * x << "\n";
36 return 0;
37}
38

This program uses a step decay learning rate to minimize the simple function f(x)=x^2. The learning rate starts at 0.2 and drops by a factor of 10 every 30 steps. You can see faster progress early on and finer adjustments after each drop, mimicking common deep learning training regimens where performance jumps occur after milestone decays.

Time: O(T) over T steps (each step is O(1))Space: O(1)
Cosine Annealing with Linear Warmup (toy optimization)
1#include <bits/stdc++.h>
2using namespace std;
3
4struct LinearWarmup {
5 double eta_start, eta_target; int T_warm;
6 LinearWarmup(double s, double t, int Tw): eta_start(s), eta_target(t), T_warm(Tw) {}
7 double operator()(int t) const {
8 if (T_warm <= 0) return eta_target;
9 if (t >= T_warm) return eta_target;
10 double alpha = static_cast<double>(t) / max(1, T_warm);
11 return eta_start + alpha * (eta_target - eta_start);
12 }
13};
14
15struct CosineAnnealing {
16 double eta_max, eta_min; int T_max;
17 CosineAnnealing(double mx, double mn, int T): eta_max(mx), eta_min(mn), T_max(T) {}
18 double operator()(int t) const {
19 int tt = min(t, T_max);
20 double cosv = cos(M_PI * (static_cast<double>(tt) / max(1, T_max)));
21 return eta_min + 0.5 * (eta_max - eta_min) * (1.0 + cosv);
22 }
23};
24
25// Compose: use warmup for first T_warm steps, then cosine (with time offset)
26struct WarmupThenCosine {
27 LinearWarmup warm; CosineAnnealing cos; int T_warm;
28 WarmupThenCosine(LinearWarmup w, CosineAnnealing c)
29 : warm(w), cos(c), T_warm(w.T_warm) {}
30 double operator()(int t) const {
31 if (t <= T_warm) return warm(t);
32 return cos(t - T_warm);
33 }
34};
35
36int main() {
37 // Settings: warm up to 0.1 in 20 steps, then cosine down to 1e-4 over 200 steps
38 WarmupThenCosine sched(
39 LinearWarmup(0.0, 0.1, 20),
40 CosineAnnealing(0.1, 1e-4, 200)
41 );
42
43 double x = 10.0; // minimize f(x)=x^2
44 const int T_total = 250;
45 for (int t = 0; t < T_total; ++t) {
46 double lr = sched(t);
47 double grad = 2.0 * x;
48 x -= lr * grad;
49 if (t % 25 == 0) {
50 cout << "t=" << t << ", lr=" << lr << ", x=" << x
51 << ", f(x)=" << x * x << "\n";
52 }
53 }
54 cout << "Final x=" << x << ", f(x)=" << x * x << "\n";
55 return 0;
56}
57

This example composes a linear warmup with cosine annealing. The LR ramps from 0 to 0.1 over 20 steps, then smoothly decays toward 1e-4 following a cosine curve over 200 steps. The warmup stabilizes early updates; the cosine provides a smooth, effective decay for later fine-tuning.

Time: O(T) over T steps (each step computes a cosine and a few arithmetic ops)Space: O(1)
Cosine Annealing With Warm Restarts (SGDR-style) scheduler demo
1#include <bits/stdc++.h>
2using namespace std;
3
4// Cosine with warm restarts: cycles of length T_i; within a cycle use cosine from eta_max to eta_min
5struct CosineWithRestarts {
6 double eta_max, eta_min;
7 vector<int> cycle_lengths; // e.g., {50, 50, 100} or geometric growth
8 CosineWithRestarts(double mx, double mn, vector<int> L): eta_max(mx), eta_min(mn), cycle_lengths(move(L)) {}
9 double operator()(int t) const {
10 int acc = 0;
11 for (size_t i = 0; i < cycle_lengths.size(); ++i) {
12 int Ti = cycle_lengths[i];
13 if (t < acc + Ti) {
14 int tprime = t - acc;
15 double cosv = cos(M_PI * (static_cast<double>(tprime) / max(1, Ti)));
16 return eta_min + 0.5 * (eta_max - eta_min) * (1.0 + cosv);
17 }
18 acc += Ti;
19 }
20 // If beyond provided cycles, stay at eta_min (or repeat last cycle pattern)
21 return eta_min;
22 }
23};
24
25int main() {
26 CosineWithRestarts sched(0.1, 1e-4, {50, 50, 100}); // three cycles
27
28 // Print the LR curve to visualize cycles
29 for (int t = 0; t < 220; ++t) {
30 double lr = sched(t);
31 if (t % 10 == 0) {
32 cout << "t=" << t << ", lr=" << lr << "\n";
33 }
34 }
35 return 0;
36}
37

This demo shows cosine annealing with warm restarts. The learning rate follows a cosine within each cycle and then restarts to the maximum at the next cycle boundary. This can re-expand the step size periodically to escape sharp minima while still decaying within each cycle.

Time: O(T + C) to print T steps with C cycles (each step is O(1) if you track the current cycle; the naive loop shown is O(C) per query but can be optimized).Space: O(C) to store cycle lengths
#learning rate schedules#step decay#cosine annealing#warmup#sgdr#cosine restarts#linear warmup#polynomial decay#exponential decay#stochastic gradient descent#optimizer hyperparameters#training stability#deep learning#epoch vs step#schedule composition