⚙️AlgorithmIntermediate

Lion Optimizer

Key Points

•
Lion (Evolved Sign Momentum) is a first-order, sign-based optimizer discovered through automated program search.
•
It updates parameters using only the sign of an exponential moving average of gradients instead of their magnitudes.
•
Lion keeps just one momentum buffer, so it uses substantially less memory than Adam while keeping strong empirical performance.
•
Each step performs two momentum updates with different coefficients and applies a decoupled weight decay like AdamW.
•
Because it uses sign directions, Lion is more scale-invariant to gradient magnitudes and can be robust to noisy training.
•
Improper ordering of momentum updates or mixing coupled and decoupled weight decay will break the algorithm’s behavior.
•
Per iteration, Lion has O(d) time and O(d) memory where d is the number of parameters, similar to SGD with momentum.
•
Lion is a good default for large models when memory is tight but you still want adaptive-like stability without the second moment.

Prerequisites

→Gradient Descent — Lion is a variant of gradient-based optimization and builds on the idea of using gradients to update parameters.
→Momentum (Exponential Moving Average) — Understanding EMA and how it stabilizes noisy gradients is essential to Lion’s sign-of-momentum update.
→Adam/AdamW Basics — Comparing Lion to Adam/AdamW clarifies the differences in state, adaptation, and decoupled weight decay.
→Vectorized Elementwise Operations — Lion applies sign and scaling elementwise to parameter vectors.
→Regularization and Weight Decay — Lion employs decoupled weight decay; knowing how this differs from L2 regularization prevents implementation errors.

Detailed Explanation

Tap terms for definitions

01Overview

Lion is a modern optimization algorithm for training machine learning models that was discovered through program search, not hand-designed. Unlike Adam, which rescales steps by an estimate of gradient variance, Lion largely ignores gradient magnitudes and instead steps in the direction of the sign of a smoothed gradient (a momentum term). This makes the algorithm simpler and more memory-efficient because it keeps only a single first-moment buffer rather than both first and second moments. A typical Lion step computes a momentum update toward the current gradient, moves parameters by the learning rate times the elementwise sign of that momentum, optionally applies decoupled weight decay, and then refreshes the momentum again with a second coefficient. In practice, Lion has shown strong empirical results on large-scale deep learning tasks (e.g., vision transformers and language models), while having a smaller memory footprint than Adam or AdamW. Conceptually, it sits between SGD with momentum (which uses raw momentum magnitudes) and signSGD (which uses only gradient signs) by using the sign of momentum, a smoothed direction that tends to be more stable than raw gradient signs.

02Intuition & Analogies

Imagine hiking to the bottom of a valley on a foggy day. You can feel which way is downhill (the direction), but your altimeter is noisy, so you don’t trust exactly how steep it is. If you only follow the direction of the slope—without overreacting to how intense the slope seems—you’ll still keep moving toward the valley, but in a more stable, less jittery way. That’s Lion: it trusts the direction (the sign) more than the magnitude. However, to avoid zig-zagging because of noise, Lion first smooths the perceived slope over time using momentum. Think of momentum as averaging your last few compass readings to get a more reliable direction. By stepping in the sign of this averaged direction, you avoid being fooled by sudden gusts (noisy gradients). The two momentum coefficients in Lion act like two dials: one determines the direction you step right now (short-term smoothing), and the other sets up your memory for the next step (slightly longer-term smoothing). Decoupled weight decay is like a gentle elastic band pulling you toward the origin independently of the gradient signal—this keeps your parameters from drifting too far, improving generalization. Because Lion doesn’t rely on second-moment estimates (variance of gradients), it saves memory and avoids some of the over-smoothing that can happen with Adam’s adaptive denominator. The trade-off is that you lose per-parameter scaling, but the sign-based step can be surprisingly robust across layers and parameter scales.

03Formal Definition

Consider parameters

w_{t}

\in

R^{d}

and stochastic gradients

g_{t}

\nabla

ℓ

(

w_{t}

;

ξ_{t}

). Lion maintains a first-moment buffer

m_{t}

and uses two momentum coefficients

β_{1}

β_{2}

\in

(0,1), a learning rate

η

> 0, and a (decoupled) weight decay

λ

\geq

0. A canonical Lion step proceeds as: 1) Directional momentum (pre-step):

m_{t +}

β_{1}

m_{t}

+ (1 -

β_{1}

)

g_{t}

. 2) Decoupled weight decay:

w_{t + 1/2}

= (1 -

η

λ

)

w_{t}

. 3) Parameter update by sign of momentum:

w_{t + 1} = w_{t + 1/2}

η

sign

(

m_{t +}

), where

sign

(x) is applied elementwise. 4) Momentum refresh (post-step):

m_{t + 1}

β_{2}

m_{t +}

+ (1 -

β_{2}

)

g_{t}

. No second-moment (variance) buffer is used, and there is typically no bias correction. The sign operator is defined elementwise as

sign

(

x_{i}

) = 1 if

x_{i} > 0

, -1 if

x_{i} < 0

, and 0 otherwise. The algorithm can be viewed as signSGD with a smoothed direction vector (momentum) and with decoupled weight decay akin to AdamW. The dual momentum coefficients provide different time scales for the immediate direction used for the step and the memory carried into the next iteration.

04When to Use

Use Lion when memory is constrained but you still want an optimizer that is more robust than vanilla SGD with momentum. It works well for large-scale deep networks (e.g., vision transformers, CNNs, and language models) where Adam/AdamW are commonly used, but you want to reduce optimizer state from two buffers (m and v) to one buffer (m). Lion is particularly helpful when gradient magnitudes are noisy or vary widely across layers—its sign-based update dampens sensitivity to such scale differences. If you rely on mixed precision (FP16/BF16) training, Lion’s reduced state can also improve throughput and memory headroom for larger batch sizes or models. Prefer Lion over Adam when you suspect second-moment adaptation is over-smoothing updates or causing sluggish convergence late in training. Prefer Adam/AdamW when per-parameter adaptive scaling is crucial (e.g., very sparse gradients with highly uneven curvature) or when you need established convergence guarantees under certain conditions. In convex or well-conditioned problems, SGD with momentum might suffice; Lion often shines in complex, high-dimensional, non-convex settings where robust directional updates and lower memory are attractive.

⚠️Common Mistakes

• Getting the order of operations wrong. In Lion, you compute the pre-step momentum m_{t+} with \beta_{1}, then perform decoupled weight decay and the sign-based parameter update, and only then refresh momentum with \beta_{2}. Swapping these steps changes behavior. • Using coupled L2 regularization instead of decoupled weight decay. Adding \lambda w to the gradient (coupled) is not the same as multiplying parameters by (1 - \eta \lambda). Lion’s standard form assumes decoupled decay. • Forgetting that the update is sign-based. Normalizing or clipping gradients after applying sign has no effect; clip before momentum updates if needed. Also be aware that gradients near zero produce zero steps; adding excessive gradient smoothing can stall learning. • Choosing learning rates and betas identical to Adam by default. Lion typically uses \beta_{1} \approx 0.9, \beta_{2} \approx 0.99, and learning rates similar to AdamW, but exact best values may differ. Always tune LR and weight decay. • Not zero-initializing momentum or reusing stale buffers across parameter shape changes. Momentum must match parameter shapes and be reset appropriately. • Applying weight decay with an overly large product \eta \lambda > 1, which can invert parameter signs. Keep \eta \lambda small (e.g., 10^{-5}–10^{-2} times LR).

Key Formulas

Pre-step Momentum

m_{t +} = β_{1} m_{t} + (1 - β_{1}) g_{t}

Explanation: Update the exponential moving average toward the current gradient using coefficient beta1. This smoothed vector defines the step direction via its sign.

Decoupled Weight Decay

w_{t + 1/2} = (1 - η λ) w_{t}

Explanation: Shrink parameters independently of the gradient update by multiplying with (1 - lr * weigh $t_{d} ec a y$ ). This improves generalization and mirrors AdamW-style decay.

Sign-based Parameter Update

w_{t + 1} = w_{t + 1/2} - η sign (m_{t +})

Explanation: Move parameters by the learning rate in the elementwise direction of the momentum. Magnitudes of m do not affect the step size, only the sign does.

Post-step Momentum Refresh

m_{t + 1} = β_{2} m_{t +} + (1 - β_{2}) g_{t}

Explanation: After stepping, refresh momentum with a potentially different coefficient beta2 to set up memory for the next iteration.

Elementwise Sign

sign (x_{i}) = ⎩ ⎨ ⎧ 10 - 1 x_{i} > 0 x_{i} = 0 x_{i} < 0

Explanation: Defines the direction-only map used in Lion. Zero entries lead to zero movement in that coordinate for that step.

EMA Template

m_{t} = β m_{t - 1} + (1 - β) g_{t}

Explanation: Generic exponential moving average used for momentum. Recent gradients are weighted more heavily than older ones.

Mean Squared Error

MSE (w) = \frac{1}{n} i = 1 \sum n (y_{i} - x_{i}^{⊤} w)^{2}

Explanation: A standard supervised learning loss used in linear regression examples. Its gradient is used to train weights.

Per-Step Complexity

O (d) time, O (d) memory per step

Explanation: Each Lion step touches every parameter once and keeps one momentum buffer of the same size. This is similar to SGD with momentum and uses half the state of Adam.

Complexity Analysis

Let d be the number of parameters. Every Lion step updates the momentum buffer elementwise and then performs an elementwise weight decay multiplication and a sign-based parameter update. These are all linear passes over the parameter vector, so the per-step time complexity is O(d). The algorithm maintains a single momentum buffer m of size d, plus the parameters themselves, yielding O(d) extra memory. In contrast, Adam/AdamW maintain two buffers (m and v), resulting in O(2d) optimizer state, so Lion saves roughly 50% of the optimizer-state memory. When training on a dataset with n samples and mini-batch size b, the number of steps per epoch is n/b. Ignoring model-forward/backward costs (which dominate in deep learning), the optimizer overhead across one epoch is O(

\frac{n}{b}

d). In practice, the computational cost of the sign operation is negligible compared to matrix multiplications and convolutions during gradient computation. Lion also does not require expensive elementwise square roots or divisions as in Adam, which can be advantageous on some hardware. Space-wise, Lion’s decoupled weight decay does not require storing additional buffers; it is a simple multiplicative factor applied to parameters each step. If gradients are computed in mixed precision, m and parameters can be stored in FP32 to improve numerical stability, with the gradient cast as needed; this does not change the asymptotic complexity but slightly increases constant factors. Overall, Lion offers a favorable time/space trade-off, particularly in memory-constrained settings, while preserving strong empirical convergence on large-scale problems.

Code Examples

Minimal Lion optimizer on the 2D Rosenbrock function

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct Lion {
5     double lr;      // learning rate
6     double beta1;   // pre-step momentum coefficient
7     double beta2;   // post-step momentum coefficient
8     double wd;      // decoupled weight decay
9     vector<double> m; // momentum buffer
10 
11     Lion(size_t dim, double lr_=1e-3, double beta1_=0.9, double beta2_=0.99, double wd_=0.0)
12         : lr(lr_), beta1(beta1_), beta2(beta2_), wd(wd_), m(dim, 0.0) {}
13 
14     static inline double sgn(double x) {
15         return (x > 0.0) - (x < 0.0); // returns +1, 0, or -1 as double
16     }
17 
18     void step(vector<double>& w, const vector<double>& g) {
19         // 1) pre-step momentum update
20         for (size_t i = 0; i < w.size(); ++i) {
21             m[i] = beta1 * m[i] + (1.0 - beta1) * g[i];
22         }
23         // 2) decoupled weight decay (AdamW-style)
24         if (wd != 0.0) {
25             double factor = max(0.0, 1.0 - lr * wd); // guard against extreme settings
26             for (size_t i = 0; i < w.size(); ++i) w[i] *= factor;
27         }
28         // 3) parameter update with sign of momentum
29         for (size_t i = 0; i < w.size(); ++i) {
30             w[i] -= lr * sgn(m[i]);
31         }
32         // 4) post-step momentum refresh
33         for (size_t i = 0; i < w.size(); ++i) {
34             m[i] = beta2 * m[i] + (1.0 - beta2) * g[i];
35         }
36     }
37 };
38 
39 // Rosenbrock function and its gradient in 2D: f(x,y)=(1-x)^2 + 100(y-x^2)^2
40 static inline double rosenbrock(const vector<double>& w) {
41     double x = w[0], y = w[1];
42     return (1 - x) * (1 - x) + 100.0 * (y - x * x) * (y - x * x);
43 }
44 
45 static inline vector<double> rosenbrock_grad(const vector<double>& w) {
46     double x = w[0], y = w[1];
47     double dx = 2.0 * (x - 1.0) - 4.0 * 100.0 * x * (y - x * x); // 2(x-1) - 400x(y - x^2)
48     double dy = 2.0 * 100.0 * (y - x * x);                       // 200(y - x^2)
49     return {dx, dy};
50 }
51 
52 int main() {
53     // Initial point far from the valley
54     vector<double> w = {-1.2, 1.0};
55     Lion opt(2, /*lr=*/1e-3, /*beta1=*/0.9, /*beta2=*/0.99, /*wd=*/0.0);
56 
57     for (int t = 1; t <= 200000; ++t) {
58         vector<double> g = rosenbrock_grad(w);
59         opt.step(w, g);
60         if (t % 20000 == 0) {
61             cout << "iter " << t << ": f=" << fixed << setprecision(6) << rosenbrock(w)
62                  << ", w=[" << w[0] << ", " << w[1] << "]\n";
63         }
64     }
65     cout << "Final: f=" << rosenbrock(w) << ", w=[" << w[0] << ", " << w[1] << "]\n";
66     return 0;
67 }
68

This example defines a simple Lion optimizer and uses it to minimize the non-convex Rosenbrock function in 2D. The algorithm performs two momentum updates (beta1 and beta2), decoupled weight decay (disabled here), and a sign-based parameter step. It demonstrates stability on a challenging landscape without using second-moment adaptation.

Time: O(T d) where T is the number of iterations and d=2 hereSpace: O(d) for the momentum buffer

Linear regression with Lion and decoupled weight decay

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct Lion {
5     double lr, beta1, beta2, wd; // lr=learning rate, wd=weight decay
6     vector<double> m;
7     Lion(size_t dim, double lr_=3e-3, double beta1_=0.9, double beta2_=0.99, double wd_=1e-4)
8         : lr(lr_), beta1(beta1_), beta2(beta2_), wd(wd_), m(dim, 0.0) {}
9     static inline double sgn(double x) { return (x > 0.0) - (x < 0.0); }
10     void step(vector<double>& w, const vector<double>& g) {
11         // Pre-step momentum
12         for (size_t i = 0; i < w.size(); ++i) m[i] = beta1 * m[i] + (1.0 - beta1) * g[i];
13         // Decoupled weight decay
14         if (wd != 0.0) {
15             double factor = max(0.0, 1.0 - lr * wd);
16             for (size_t i = 0; i < w.size(); ++i) w[i] *= factor;
17         }
18         // Sign-based update
19         for (size_t i = 0; i < w.size(); ++i) w[i] -= lr * sgn(m[i]);
20         // Post-step momentum refresh
21         for (size_t i = 0; i < w.size(); ++i) m[i] = beta2 * m[i] + (1.0 - beta2) * g[i];
22     }
23 };
24 
25 int main() {
26     // Generate synthetic data: y = Xw* + noise
27     std::mt19937 rng(42);
28     int n = 400, d = 5;
29     vector<vector<double>> X(n, vector<double>(d));
30     vector<double> w_true(d), y(n);
31 
32     normal_distribution<double> nd(0.0, 1.0);
33     for (int j = 0; j < d; ++j) w_true[j] = nd(rng) * 2.0; // ground-truth weights
34     for (int i = 0; i < n; ++i) {
35         for (int j = 0; j < d; ++j) X[i][j] = nd(rng);
36         double yi = 0.0;
37         for (int j = 0; j < d; ++j) yi += X[i][j] * w_true[j];
38         yi += normal_distribution<double>(0.0, 0.1)(rng); // small noise
39         y[i] = yi;
40     }
41 
42     // Initialize model weights
43     vector<double> w(d, 0.0);
44     Lion opt(d, /*lr=*/3e-3, /*beta1=*/0.9, /*beta2=*/0.99, /*wd=*/1e-4);
45 
46     auto mse_and_grad = [&](const vector<double>& w){
47         vector<double> g(d, 0.0);
48         double mse = 0.0;
49         for (int i = 0; i < n; ++i) {
50             double pred = 0.0;
51             for (int j = 0; j < d; ++j) pred += X[i][j] * w[j];
52             double e = pred - y[i];
53             mse += e * e;
54             for (int j = 0; j < d; ++j) g[j] += (2.0 / n) * e * X[i][j];
55         }
56         mse /= n;
57         return pair<double, vector<double>>(mse, g);
58     };
59 
60     for (int epoch = 1; epoch <= 200; ++epoch) {
61         auto [loss, grad] = mse_and_grad(w);
62         opt.step(w, grad);
63         if (epoch % 20 == 0) {
64             // Report loss and distance to ground truth
65             double dist = 0.0;
66             for (int j = 0; j < d; ++j) dist += (w[j] - w_true[j]) * (w[j] - w_true[j]);
67             cout << "epoch " << epoch << ": MSE=" << fixed << setprecision(6) << loss
68                  << ", ||w - w_true||^2=" << dist << "\n";
69         }
70     }
71 
72     cout << "Learned weights:\n";
73     for (int j = 0; j < d; ++j) cout << w[j] << (j+1==d?'\n':' ');
74     cout << "True weights:\n";
75     for (int j = 0; j < d; ++j) cout << w_true[j] << (j+1==d?'\n':' ');
76     return 0;
77 }
78

We fit a linear regression model with mean squared error using Lion. The example shows how decoupled weight decay is applied multiplicatively to parameters, independent of the gradient. The optimizer performs O(d) work per epoch in addition to the O(nd) forward/backward costs.

Time: O(E (nd + d)) for E epochs; optimizer overhead is O(E d)Space: O(d) for optimizer state plus O(nd) for data storage in this example

Demonstrating Lion's scale invariance to gradient magnitudes

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct LionOneStep {
5     double lr=1e-3, b1=0.9, b2=0.99;
6     vector<double> m;
7     LionOneStep(size_t d): m(d, 0.0) {}
8     static inline double sgn(double x){ return (x>0.0) - (x<0.0); }
9     void step(vector<double>& w, const vector<double>& g){
10         for(size_t i=0;i<w.size();++i) m[i] = b1*m[i] + (1.0-b1)*g[i];
11         for(size_t i=0;i<w.size();++i) w[i] -= lr * sgn(m[i]);
12         for(size_t i=0;i<w.size();++i) m[i] = b2*m[i] + (1.0-b2)*g[i];
13     }
14 };
15 
16 int main(){
17     vector<double> w1 = {1.0, -2.0, 0.5};
18     vector<double> w2 = w1; // same start
19     vector<double> g  = {0.3, -4.0, 1.0};
20     double k = 100.0; // scale factor
21     vector<double> g_scaled = g;
22     for(double &x: g_scaled) x *= k;
23 
24     LionOneStep opt1(w1.size()), opt2(w2.size());
25     opt1.step(w1, g);
26     opt2.step(w2, g_scaled);
27 
28     cout << fixed << setprecision(6);
29     cout << "After one step with g:      w = [" << w1[0] << ", " << w1[1] << ", " << w1[2] << "]\n";
30     cout << "After one step with k*g:   w = [" << w2[0] << ", " << w2[1] << ", " << w2[2] << "]\n";
31     cout << "(They should be identical because sign(k*g) = sign(g) for k>0.)\n";
32     return 0;
33 }
34

This small program shows that scaling the gradient by a positive constant does not change Lion’s update (ignoring momentum initialization), because the sign of the momentum remains the same. Thus, the parameter vectors after one step match exactly.

Time: O(d) for one stepSpace: O(d) for the momentum buffer

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct Lion {
5	double lr; // learning rate
6	double beta1; // pre-step momentum coefficient
7	double beta2; // post-step momentum coefficient
8	double wd; // decoupled weight decay
9	vector<double> m; // momentum buffer
10
11	Lion(size_t dim, double lr_=1e-3, double beta1_=0.9, double beta2_=0.99, double wd_=0.0)
12	: lr(lr_), beta1(beta1_), beta2(beta2_), wd(wd_), m(dim, 0.0) {}
13
14	static inline double sgn(double x) {
15	return (x > 0.0) - (x < 0.0); // returns +1, 0, or -1 as double
16	}
17
18	void step(vector<double>& w, const vector<double>& g) {
19	// 1) pre-step momentum update
20	for (size_t i = 0; i < w.size(); ++i) {
21	m[i] = beta1 * m[i] + (1.0 - beta1) * g[i];
22	}
23	// 2) decoupled weight decay (AdamW-style)
24	if (wd != 0.0) {
25	double factor = max(0.0, 1.0 - lr * wd); // guard against extreme settings
26	for (size_t i = 0; i < w.size(); ++i) w[i] *= factor;
27	}
28	// 3) parameter update with sign of momentum
29	for (size_t i = 0; i < w.size(); ++i) {
30	w[i] -= lr * sgn(m[i]);
31	}
32	// 4) post-step momentum refresh
33	for (size_t i = 0; i < w.size(); ++i) {
34	m[i] = beta2 * m[i] + (1.0 - beta2) * g[i];
35	}
36	}
37	};
38
39	// Rosenbrock function and its gradient in 2D: f(x,y)=(1-x)^2 + 100(y-x^2)^2
40	static inline double rosenbrock(const vector<double>& w) {
41	double x = w[0], y = w[1];
42	return (1 - x) * (1 - x) + 100.0 * (y - x * x) * (y - x * x);
43	}
44
45	static inline vector<double> rosenbrock_grad(const vector<double>& w) {
46	double x = w[0], y = w[1];
47	double dx = 2.0 * (x - 1.0) - 4.0 * 100.0 * x * (y - x * x); // 2(x-1) - 400x(y - x^2)
48	double dy = 2.0 * 100.0 * (y - x * x); // 200(y - x^2)
49	return {dx, dy};
50	}
51
52	int main() {
53	// Initial point far from the valley
54	vector<double> w = {-1.2, 1.0};
55	Lion opt(2, /lr=/1e-3, /beta1=/0.9, /beta2=/0.99, /wd=/0.0);
56
57	for (int t = 1; t <= 200000; ++t) {
58	vector<double> g = rosenbrock_grad(w);
59	opt.step(w, g);
60	if (t % 20000 == 0) {
61	cout << "iter " << t << ": f=" << fixed << setprecision(6) << rosenbrock(w)
62	<< ", w=[" << w[0] << ", " << w[1] << "]\n";
63	}
64	}
65	cout << "Final: f=" << rosenbrock(w) << ", w=[" << w[0] << ", " << w[1] << "]\n";
66	return 0;
67	}
68

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct LionOneStep {
5	double lr=1e-3, b1=0.9, b2=0.99;
6	vector<double> m;
7	LionOneStep(size_t d): m(d, 0.0) {}
8	static inline double sgn(double x){ return (x>0.0) - (x<0.0); }
9	void step(vector<double>& w, const vector<double>& g){
10	for(size_t i=0;i<w.size();++i) m[i] = b1m[i] + (1.0-b1)g[i];
11	for(size_t i=0;i<w.size();++i) w[i] -= lr * sgn(m[i]);
12	for(size_t i=0;i<w.size();++i) m[i] = b2m[i] + (1.0-b2)g[i];
13	}
14	};
15
16	int main(){
17	vector<double> w1 = {1.0, -2.0, 0.5};
18	vector<double> w2 = w1; // same start
19	vector<double> g = {0.3, -4.0, 1.0};
20	double k = 100.0; // scale factor
21	vector<double> g_scaled = g;
22	for(double &x: g_scaled) x *= k;
23
24	LionOneStep opt1(w1.size()), opt2(w2.size());
25	opt1.step(w1, g);
26	opt2.step(w2, g_scaled);
27
28	cout << fixed << setprecision(6);
29	cout << "After one step with g: w = [" << w1[0] << ", " << w1[1] << ", " << w1[2] << "]\n";
30	cout << "After one step with k*g: w = [" << w2[0] << ", " << w2[1] << ", " << w2[2] << "]\n";
31	cout << "(They should be identical because sign(k*g) = sign(g) for k>0.)\n";
32	return 0;
33	}
34