Lion Optimizer
Key Points
- •Lion (Evolved Sign Momentum) is a first-order, sign-based optimizer discovered through automated program search.
- •It updates parameters using only the sign of an exponential moving average of gradients instead of their magnitudes.
- •Lion keeps just one momentum buffer, so it uses substantially less memory than Adam while keeping strong empirical performance.
- •Each step performs two momentum updates with different coefficients and applies a decoupled weight decay like AdamW.
- •Because it uses sign directions, Lion is more scale-invariant to gradient magnitudes and can be robust to noisy training.
- •Improper ordering of momentum updates or mixing coupled and decoupled weight decay will break the algorithm’s behavior.
- •Per iteration, Lion has O(d) time and O(d) memory where d is the number of parameters, similar to SGD with momentum.
- •Lion is a good default for large models when memory is tight but you still want adaptive-like stability without the second moment.
Prerequisites
- →Gradient Descent — Lion is a variant of gradient-based optimization and builds on the idea of using gradients to update parameters.
- →Momentum (Exponential Moving Average) — Understanding EMA and how it stabilizes noisy gradients is essential to Lion’s sign-of-momentum update.
- →Adam/AdamW Basics — Comparing Lion to Adam/AdamW clarifies the differences in state, adaptation, and decoupled weight decay.
- →Vectorized Elementwise Operations — Lion applies sign and scaling elementwise to parameter vectors.
- →Regularization and Weight Decay — Lion employs decoupled weight decay; knowing how this differs from L2 regularization prevents implementation errors.
Detailed Explanation
Tap terms for definitions01Overview
Lion is a modern optimization algorithm for training machine learning models that was discovered through program search, not hand-designed. Unlike Adam, which rescales steps by an estimate of gradient variance, Lion largely ignores gradient magnitudes and instead steps in the direction of the sign of a smoothed gradient (a momentum term). This makes the algorithm simpler and more memory-efficient because it keeps only a single first-moment buffer rather than both first and second moments. A typical Lion step computes a momentum update toward the current gradient, moves parameters by the learning rate times the elementwise sign of that momentum, optionally applies decoupled weight decay, and then refreshes the momentum again with a second coefficient. In practice, Lion has shown strong empirical results on large-scale deep learning tasks (e.g., vision transformers and language models), while having a smaller memory footprint than Adam or AdamW. Conceptually, it sits between SGD with momentum (which uses raw momentum magnitudes) and signSGD (which uses only gradient signs) by using the sign of momentum, a smoothed direction that tends to be more stable than raw gradient signs.
02Intuition & Analogies
Imagine hiking to the bottom of a valley on a foggy day. You can feel which way is downhill (the direction), but your altimeter is noisy, so you don’t trust exactly how steep it is. If you only follow the direction of the slope—without overreacting to how intense the slope seems—you’ll still keep moving toward the valley, but in a more stable, less jittery way. That’s Lion: it trusts the direction (the sign) more than the magnitude. However, to avoid zig-zagging because of noise, Lion first smooths the perceived slope over time using momentum. Think of momentum as averaging your last few compass readings to get a more reliable direction. By stepping in the sign of this averaged direction, you avoid being fooled by sudden gusts (noisy gradients). The two momentum coefficients in Lion act like two dials: one determines the direction you step right now (short-term smoothing), and the other sets up your memory for the next step (slightly longer-term smoothing). Decoupled weight decay is like a gentle elastic band pulling you toward the origin independently of the gradient signal—this keeps your parameters from drifting too far, improving generalization. Because Lion doesn’t rely on second-moment estimates (variance of gradients), it saves memory and avoids some of the over-smoothing that can happen with Adam’s adaptive denominator. The trade-off is that you lose per-parameter scaling, but the sign-based step can be surprisingly robust across layers and parameter scales.
03Formal Definition
04When to Use
Use Lion when memory is constrained but you still want an optimizer that is more robust than vanilla SGD with momentum. It works well for large-scale deep networks (e.g., vision transformers, CNNs, and language models) where Adam/AdamW are commonly used, but you want to reduce optimizer state from two buffers (m and v) to one buffer (m). Lion is particularly helpful when gradient magnitudes are noisy or vary widely across layers—its sign-based update dampens sensitivity to such scale differences. If you rely on mixed precision (FP16/BF16) training, Lion’s reduced state can also improve throughput and memory headroom for larger batch sizes or models. Prefer Lion over Adam when you suspect second-moment adaptation is over-smoothing updates or causing sluggish convergence late in training. Prefer Adam/AdamW when per-parameter adaptive scaling is crucial (e.g., very sparse gradients with highly uneven curvature) or when you need established convergence guarantees under certain conditions. In convex or well-conditioned problems, SGD with momentum might suffice; Lion often shines in complex, high-dimensional, non-convex settings where robust directional updates and lower memory are attractive.
⚠️Common Mistakes
• Getting the order of operations wrong. In Lion, you compute the pre-step momentum m_{t+} with \beta_{1}, then perform decoupled weight decay and the sign-based parameter update, and only then refresh momentum with \beta_{2}. Swapping these steps changes behavior. • Using coupled L2 regularization instead of decoupled weight decay. Adding \lambda w to the gradient (coupled) is not the same as multiplying parameters by (1 - \eta \lambda). Lion’s standard form assumes decoupled decay. • Forgetting that the update is sign-based. Normalizing or clipping gradients after applying sign has no effect; clip before momentum updates if needed. Also be aware that gradients near zero produce zero steps; adding excessive gradient smoothing can stall learning. • Choosing learning rates and betas identical to Adam by default. Lion typically uses \beta_{1} \approx 0.9, \beta_{2} \approx 0.99, and learning rates similar to AdamW, but exact best values may differ. Always tune LR and weight decay. • Not zero-initializing momentum or reusing stale buffers across parameter shape changes. Momentum must match parameter shapes and be reset appropriately. • Applying weight decay with an overly large product \eta \lambda > 1, which can invert parameter signs. Keep \eta \lambda small (e.g., 10^{-5}–10^{-2} times LR).
Key Formulas
Pre-step Momentum
Explanation: Update the exponential moving average toward the current gradient using coefficient beta1. This smoothed vector defines the step direction via its sign.
Decoupled Weight Decay
Explanation: Shrink parameters independently of the gradient update by multiplying with (1 - lr * weigh). This improves generalization and mirrors AdamW-style decay.
Sign-based Parameter Update
Explanation: Move parameters by the learning rate in the elementwise direction of the momentum. Magnitudes of m do not affect the step size, only the sign does.
Post-step Momentum Refresh
Explanation: After stepping, refresh momentum with a potentially different coefficient beta2 to set up memory for the next iteration.
Elementwise Sign
Explanation: Defines the direction-only map used in Lion. Zero entries lead to zero movement in that coordinate for that step.
EMA Template
Explanation: Generic exponential moving average used for momentum. Recent gradients are weighted more heavily than older ones.
Mean Squared Error
Explanation: A standard supervised learning loss used in linear regression examples. Its gradient is used to train weights.
Per-Step Complexity
Explanation: Each Lion step touches every parameter once and keeps one momentum buffer of the same size. This is similar to SGD with momentum and uses half the state of Adam.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct Lion { 5 double lr; // learning rate 6 double beta1; // pre-step momentum coefficient 7 double beta2; // post-step momentum coefficient 8 double wd; // decoupled weight decay 9 vector<double> m; // momentum buffer 10 11 Lion(size_t dim, double lr_=1e-3, double beta1_=0.9, double beta2_=0.99, double wd_=0.0) 12 : lr(lr_), beta1(beta1_), beta2(beta2_), wd(wd_), m(dim, 0.0) {} 13 14 static inline double sgn(double x) { 15 return (x > 0.0) - (x < 0.0); // returns +1, 0, or -1 as double 16 } 17 18 void step(vector<double>& w, const vector<double>& g) { 19 // 1) pre-step momentum update 20 for (size_t i = 0; i < w.size(); ++i) { 21 m[i] = beta1 * m[i] + (1.0 - beta1) * g[i]; 22 } 23 // 2) decoupled weight decay (AdamW-style) 24 if (wd != 0.0) { 25 double factor = max(0.0, 1.0 - lr * wd); // guard against extreme settings 26 for (size_t i = 0; i < w.size(); ++i) w[i] *= factor; 27 } 28 // 3) parameter update with sign of momentum 29 for (size_t i = 0; i < w.size(); ++i) { 30 w[i] -= lr * sgn(m[i]); 31 } 32 // 4) post-step momentum refresh 33 for (size_t i = 0; i < w.size(); ++i) { 34 m[i] = beta2 * m[i] + (1.0 - beta2) * g[i]; 35 } 36 } 37 }; 38 39 // Rosenbrock function and its gradient in 2D: f(x,y)=(1-x)^2 + 100(y-x^2)^2 40 static inline double rosenbrock(const vector<double>& w) { 41 double x = w[0], y = w[1]; 42 return (1 - x) * (1 - x) + 100.0 * (y - x * x) * (y - x * x); 43 } 44 45 static inline vector<double> rosenbrock_grad(const vector<double>& w) { 46 double x = w[0], y = w[1]; 47 double dx = 2.0 * (x - 1.0) - 4.0 * 100.0 * x * (y - x * x); // 2(x-1) - 400x(y - x^2) 48 double dy = 2.0 * 100.0 * (y - x * x); // 200(y - x^2) 49 return {dx, dy}; 50 } 51 52 int main() { 53 // Initial point far from the valley 54 vector<double> w = {-1.2, 1.0}; 55 Lion opt(2, /*lr=*/1e-3, /*beta1=*/0.9, /*beta2=*/0.99, /*wd=*/0.0); 56 57 for (int t = 1; t <= 200000; ++t) { 58 vector<double> g = rosenbrock_grad(w); 59 opt.step(w, g); 60 if (t % 20000 == 0) { 61 cout << "iter " << t << ": f=" << fixed << setprecision(6) << rosenbrock(w) 62 << ", w=[" << w[0] << ", " << w[1] << "]\n"; 63 } 64 } 65 cout << "Final: f=" << rosenbrock(w) << ", w=[" << w[0] << ", " << w[1] << "]\n"; 66 return 0; 67 } 68
This example defines a simple Lion optimizer and uses it to minimize the non-convex Rosenbrock function in 2D. The algorithm performs two momentum updates (beta1 and beta2), decoupled weight decay (disabled here), and a sign-based parameter step. It demonstrates stability on a challenging landscape without using second-moment adaptation.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct Lion { 5 double lr, beta1, beta2, wd; // lr=learning rate, wd=weight decay 6 vector<double> m; 7 Lion(size_t dim, double lr_=3e-3, double beta1_=0.9, double beta2_=0.99, double wd_=1e-4) 8 : lr(lr_), beta1(beta1_), beta2(beta2_), wd(wd_), m(dim, 0.0) {} 9 static inline double sgn(double x) { return (x > 0.0) - (x < 0.0); } 10 void step(vector<double>& w, const vector<double>& g) { 11 // Pre-step momentum 12 for (size_t i = 0; i < w.size(); ++i) m[i] = beta1 * m[i] + (1.0 - beta1) * g[i]; 13 // Decoupled weight decay 14 if (wd != 0.0) { 15 double factor = max(0.0, 1.0 - lr * wd); 16 for (size_t i = 0; i < w.size(); ++i) w[i] *= factor; 17 } 18 // Sign-based update 19 for (size_t i = 0; i < w.size(); ++i) w[i] -= lr * sgn(m[i]); 20 // Post-step momentum refresh 21 for (size_t i = 0; i < w.size(); ++i) m[i] = beta2 * m[i] + (1.0 - beta2) * g[i]; 22 } 23 }; 24 25 int main() { 26 // Generate synthetic data: y = Xw* + noise 27 std::mt19937 rng(42); 28 int n = 400, d = 5; 29 vector<vector<double>> X(n, vector<double>(d)); 30 vector<double> w_true(d), y(n); 31 32 normal_distribution<double> nd(0.0, 1.0); 33 for (int j = 0; j < d; ++j) w_true[j] = nd(rng) * 2.0; // ground-truth weights 34 for (int i = 0; i < n; ++i) { 35 for (int j = 0; j < d; ++j) X[i][j] = nd(rng); 36 double yi = 0.0; 37 for (int j = 0; j < d; ++j) yi += X[i][j] * w_true[j]; 38 yi += normal_distribution<double>(0.0, 0.1)(rng); // small noise 39 y[i] = yi; 40 } 41 42 // Initialize model weights 43 vector<double> w(d, 0.0); 44 Lion opt(d, /*lr=*/3e-3, /*beta1=*/0.9, /*beta2=*/0.99, /*wd=*/1e-4); 45 46 auto mse_and_grad = [&](const vector<double>& w){ 47 vector<double> g(d, 0.0); 48 double mse = 0.0; 49 for (int i = 0; i < n; ++i) { 50 double pred = 0.0; 51 for (int j = 0; j < d; ++j) pred += X[i][j] * w[j]; 52 double e = pred - y[i]; 53 mse += e * e; 54 for (int j = 0; j < d; ++j) g[j] += (2.0 / n) * e * X[i][j]; 55 } 56 mse /= n; 57 return pair<double, vector<double>>(mse, g); 58 }; 59 60 for (int epoch = 1; epoch <= 200; ++epoch) { 61 auto [loss, grad] = mse_and_grad(w); 62 opt.step(w, grad); 63 if (epoch % 20 == 0) { 64 // Report loss and distance to ground truth 65 double dist = 0.0; 66 for (int j = 0; j < d; ++j) dist += (w[j] - w_true[j]) * (w[j] - w_true[j]); 67 cout << "epoch " << epoch << ": MSE=" << fixed << setprecision(6) << loss 68 << ", ||w - w_true||^2=" << dist << "\n"; 69 } 70 } 71 72 cout << "Learned weights:\n"; 73 for (int j = 0; j < d; ++j) cout << w[j] << (j+1==d?'\n':' '); 74 cout << "True weights:\n"; 75 for (int j = 0; j < d; ++j) cout << w_true[j] << (j+1==d?'\n':' '); 76 return 0; 77 } 78
We fit a linear regression model with mean squared error using Lion. The example shows how decoupled weight decay is applied multiplicatively to parameters, independent of the gradient. The optimizer performs O(d) work per epoch in addition to the O(nd) forward/backward costs.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct LionOneStep { 5 double lr=1e-3, b1=0.9, b2=0.99; 6 vector<double> m; 7 LionOneStep(size_t d): m(d, 0.0) {} 8 static inline double sgn(double x){ return (x>0.0) - (x<0.0); } 9 void step(vector<double>& w, const vector<double>& g){ 10 for(size_t i=0;i<w.size();++i) m[i] = b1*m[i] + (1.0-b1)*g[i]; 11 for(size_t i=0;i<w.size();++i) w[i] -= lr * sgn(m[i]); 12 for(size_t i=0;i<w.size();++i) m[i] = b2*m[i] + (1.0-b2)*g[i]; 13 } 14 }; 15 16 int main(){ 17 vector<double> w1 = {1.0, -2.0, 0.5}; 18 vector<double> w2 = w1; // same start 19 vector<double> g = {0.3, -4.0, 1.0}; 20 double k = 100.0; // scale factor 21 vector<double> g_scaled = g; 22 for(double &x: g_scaled) x *= k; 23 24 LionOneStep opt1(w1.size()), opt2(w2.size()); 25 opt1.step(w1, g); 26 opt2.step(w2, g_scaled); 27 28 cout << fixed << setprecision(6); 29 cout << "After one step with g: w = [" << w1[0] << ", " << w1[1] << ", " << w1[2] << "]\n"; 30 cout << "After one step with k*g: w = [" << w2[0] << ", " << w2[1] << ", " << w2[2] << "]\n"; 31 cout << "(They should be identical because sign(k*g) = sign(g) for k>0.)\n"; 32 return 0; 33 } 34
This small program shows that scaling the gradient by a positive constant does not change Lion’s update (ignoring momentum initialization), because the sign of the momentum remains the same. Thus, the parameter vectors after one step match exactly.