🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
📚TheoryIntermediate

Lottery Ticket Hypothesis

Key Points

  • •
    The Lottery Ticket Hypothesis (LTH) says that inside a large dense neural network there exist small sparse subnetworks that, when trained in isolation from their original initialization, can reach comparable accuracy to the full model.
  • •
    These subnetworks are found by training the dense model, pruning the smallest-magnitude weights, and resetting the remaining weights to their initial values before retraining.
  • •
    A binary mask m selects which parameters stay; pruned weights are fixed to zero and never updated during training.
  • •
    Unstructured pruning (individual weights) commonly reveals winning tickets at high sparsity, while structured pruning (entire neurons/filters) is more hardware-friendly but usually needs lower sparsity to work well.
  • •
    Winning tickets tend to train faster in early epochs and can generalize as well as the dense model when found properly.
  • •
    Rewinding to the exact initial weights is the classic LTH; variants rewind to an early training iteration to stabilize deeper networks.
  • •
    Magnitude pruning can be implemented efficiently using selection (nthe​lement), making the search computationally practical.
  • •
    Although LTH reduces inference cost through sparsity, the search still requires at least one dense training run (and often several pruning rounds).

Prerequisites

  • →Gradient-based optimization (SGD) — Understanding how weights are updated and how loss gradients flow is essential to implement pruning and rewinding correctly.
  • →Neural network fundamentals — Knowing layers, activations, and parameter counts helps reason about sparsity and its impact.
  • →Overparameterization and regularization — Motivates why large models can contain effective sparse subnetworks and how pruning acts as implicit regularization.
  • →Probability and random initialization — The hypothesis depends on the luck of the draw at initialization; reproducibility requires seed control.
  • →Algorithmic selection and sorting — Efficient pruning uses quantile selection (nth_element) rather than full sorting to compute thresholds.

Detailed Explanation

Tap terms for definitions

01Overview

The Lottery Ticket Hypothesis (LTH) is an idea about why large neural networks work so well. It claims that within an overparameterized dense network there exist sparse subnetworks—called winning tickets—that, when trained alone from their original initialization, can reach test accuracy comparable to the full network. This means many parameters are not strictly necessary for learning; they act like a safety net that makes optimization easier, while a much smaller subset is actually sufficient to learn the task.

Practically, a winning ticket is discovered by three steps: train the dense network, prune a fraction of small-magnitude weights to create a binary mask, and reset (rewind) the surviving weights to their original initial values before retraining. If the resulting sparse network matches the dense model’s accuracy with far fewer parameters, we have identified a winning ticket.

LTH provides a lens to understand overparameterization and optimization: large networks are easy to train because they hide many good sparse solutions within them. It also suggests a path to computational savings, especially at inference time, by deploying only the sparse subnetwork. However, it does not eliminate the upfront cost of the initial training needed to discover the mask, and the benefits depend on pruning strategy, sparsity level, and the problem domain.

02Intuition & Analogies

Imagine you buy a huge roll of scratch-off lottery tickets. You don’t need all of them to win; you just need the right few. A large dense network is like that roll—full of many potential subnetworks with different initial weights. Training the large model is like scratching all the tickets just enough to see which ones look promising. Pruning removes the duds, leaving the tickets that can actually win. Resetting to the original numbers on those winning tickets and playing them again is like training the subnetwork alone from its original initialization.

Another analogy is sculpting from a block of marble. The full dense network is the block; pruning is the chisel removing unneeded pieces. The sculpture (the winning subnetwork) was always inside the block—you just had to remove the extra material to reveal it. Importantly, you don’t rebuild the sculpture from fresh stone; you return to the block’s original structure for the parts you keep (rewinding to initial weights), because those original shapes help the final piece form well during training.

Finally, think of a mixing console with hundreds of knobs (weights). The exact combination to produce a clean sound may only require a smaller set of knobs. Tuning all knobs first helps you learn which ones matter (training dense); then you tape over the irrelevant knobs (mask), reset the important ones to their starting positions (rewind), and re-tune just those to quickly achieve the same sound quality (retraining the sparse net).

03Formal Definition

Consider a parameter vector w ∈ RP initialized as w0​ by a random initializer and trained by SGD (or similar) with learning rate schedule η(t) on dataset D. A binary mask m ∈ \{0,1\}^{P} defines a subnetwork by elementwise product m ⊙ w. The Lottery Ticket Hypothesis states: there exists a mask m such that, when training only parameters indexed by m from their initial values m ⊙ w0​ using the same optimizer and schedule, the resulting test accuracy and convergence speed are comparable to those of training the full dense network from w0​. Operationally, a common way to find m is magnitude pruning: after training for some epochs (or to completion), compute a threshold τ so that a target sparsity is reached, and set mi​=1(∣wi​∣>τ). The ‘rewinding’ step replaces current weights with the original initialization on the kept indices: w ← m ⊙ w0​. Variants called weight rewinding reset to an early training checkpoint wk​ (k small) rather than w0​, which can improve stability in deeper nets. A subnetwork that matches accuracy is called a winning ticket; its sparsity is s=1 - \∣m∥_{0}/P. Empirically, winning tickets often exhibit faster early training (lower loss in initial epochs) and comparable generalization to the dense model across image and NLP tasks at moderate to high sparsities, especially with unstructured pruning.

04When to Use

Use LTH-inspired pruning when you want to deploy lighter models without significantly sacrificing accuracy, especially when inference latency, memory, or energy are constrained (e.g., mobile apps, embedded devices). It is also useful for model exploration: by revealing which connections matter, pruning can improve interpretability and guide architecture design. In research settings, LTH provides hypotheses to test about overparameterization and optimization landscapes.

Practically, choose unstructured magnitude pruning to maximize sparsity and accuracy with minimal code changes, particularly when your hardware or library supports sparse tensors. Use structured pruning (whole neurons, channels, or attention heads) when you need speed on dense hardware like standard GPUs/CPUs without specialized sparse kernels. Apply iterative pruning (small percentages over multiple rounds) if one-shot pruning hurts accuracy, or when seeking very high sparsity. Consider rewinding to a small early-epoch checkpoint when training very deep networks, as initialization rewinding may be unstable there.

Avoid LTH for tiny datasets or underparameterized models where pruning removes critical capacity. Also avoid when training cost is extremely high and you cannot afford the initial dense run, unless you can reuse a pre-trained model or perform pruning during fine-tuning.

⚠️Common Mistakes

  • Pruning too aggressively in one shot: Removing a large fraction of weights at once can destroy signal paths. Prefer iterative pruning (e.g., 10–20% per round) to maintain performance.
  • Forgetting to rewind: Training the pruned network from its post-training weights is a different setting (fine-tuning a pruned model) and does not test the LTH claim. Proper LTH requires resetting kept weights to their original (or early-iteration) values before retraining.
  • Pruning biases and normalization parameters indiscriminately: Many implementations prune only kernel weights, not biases or BatchNorm scales, to avoid destabilizing training. Be explicit about what can be pruned.
  • Ignoring optimizer state: If you rewind weights but not optimizer state (e.g., momentum), you may inadvertently change the training dynamics. For strict LTH tests, also reset optimizer state.
  • Misinterpreting sparsity: Reporting fraction pruned without clarifying whether it’s per-layer, global, or structured can be misleading. Global magnitude pruning across all layers typically yields better accuracy at a given overall sparsity.
  • Expecting training-time savings automatically: LTH mainly saves inference cost. Unless your framework exploits sparsity during training, the training-time may not decrease proportionally.
  • Not controlling randomness: Different seeds can produce different tickets. For fair comparisons, fix seeds, learning rates, and data orders, and repeat over multiple seeds.

Key Formulas

Masked Network

f(x;m⊙w)

Explanation: The model’s output uses only parameters selected by the binary mask m. Pruned weights (where mi​=0) have no effect on the forward pass.

Sparsity

s=1−P∥m∥0​​

Explanation: Sparsity s is the fraction of parameters that are zero. Here P is the total number of parameters and ∥ m ∥0​ counts how many are kept (nonzero).

Magnitude Pruning Rule

mi​=1(∣wi​∣>τ),τ=Qp​(∣w∣)

Explanation: Keep weights whose magnitude exceeds a threshold τ chosen as the p-th quantile of |w| to achieve a target sparsity. This is global unstructured pruning.

Rewinding Step

w←m⊙w0​(rewind to init)

Explanation: After determining the mask using trained weights, set the surviving parameters back to their initial values before retraining the sparse network.

Training Cost Model

Ttrain​≈E⋅N⋅Cfb​(P)

Explanation: Total training time is approximately epochs E times number of samples N times cost per forward-backward pass, which depends on parameter count P.

Binary Cross-Entropy

BCE(y,y^​)=−[ylog(y^​)+(1−y)log(1−y^​)]

Explanation: Common loss for binary classification with a sigmoid output. Its gradient leads to the simple error term (y^​ - y) at the output layer.

Accuracy

acc=N1​i=1∑N​1(y^​i​=yi​)

Explanation: Measures the fraction of correct predictions on a dataset of size N. For binary problems, predictions are often thresholded at 0.5.

Quantile via Selection

τ=selectk​(∣w∣),k=⌊pP⌋

Explanation: The pruning threshold τ can be found by selecting the k-th smallest magnitude using an O(P) selection algorithm like nthe​lement, avoiding full sort.

Complexity Analysis

Let P be the number of parameters, N the number of training samples, and E the number of epochs. A dense forward-backward pass for a fully connected network costs O(P) arithmetic up to constant factors that depend on layer shapes; thus total dense training is O(E N P). Discovering a lottery ticket typically runs at least one dense training phase to obtain weight magnitudes, so the search has this baseline cost. Magnitude pruning requires computing the absolute values of all prunable parameters, which is O(P). If you sort to get the pruning threshold, it is O(P log P), but using selection (e.g., nthe​lement) reduces it to expected O(P) and worst-case O(P) on average practical inputs. Applying the mask and zeroing pruned weights is linear, O(P). Retraining the sparse subnetwork ideally reduces FLOPs proportionally to the number of active parameters, yielding O(E' N (1-s) P), where s is sparsity and E' is the retraining epochs (often similar to E). However, actual speedups depend on hardware and libraries: unstructured sparsity may not accelerate dense kernels, while structured sparsity (pruning neurons/channels) can yield wall-clock gains on standard hardware. Memory usage during dense training is O(P) for parameters plus O(A) for activations (A scales with batch size and layer widths). Storing the mask adds O(P) memory. If you keep the initial weights for rewinding, you either store a second copy O(P) or store the random seed and regeneration procedure (O(1) metadata). Overall, the computational bottleneck is the initial dense training; pruning and masking are linear-time overheads amortized over training.

Code Examples

One-shot Lottery Ticket on XOR: Train, Prune by Magnitude, Rewind, Retrain
1#include <iostream>
2#include <vector>
3#include <random>
4#include <algorithm>
5#include <cmath>
6#include <numeric>
7
8// Minimal 2-layer MLP for binary classification on XOR
9// Demonstrates: dense training -> global unstructured magnitude pruning -> rewind to init -> retrain
10
11struct MLP {
12 int in, hid, out;
13 // Parameters
14 std::vector<double> W1, b1, W2, b2;
15 // Initial parameters for rewinding
16 std::vector<double> W1_init, b1_init, W2_init, b2_init;
17 // Masks (1=keep, 0=prune). We only prune weights (W1, W2); biases kept by default
18 std::vector<double> M1, M2; // same shapes as W1, W2
19 std::mt19937 rng;
20
21 MLP(int in_, int hid_, int out_, unsigned seed=42): in(in_), hid(hid_), out(out_),
22 W1(in_*hid_), b1(hid_), W2(hid_*out_), b2(out_),
23 W1_init(in_*hid_), b1_init(hid_), W2_init(hid_*out_), b2_init(out_),
24 M1(in_*hid_, 1.0), M2(hid_*out_, 1.0), rng(seed) {
25 std::normal_distribution<double> nd(0.0, 0.5); // Xavier-ish for tiny net
26 for (size_t i=0;i<W1.size();++i) W1[i]=W1_init[i]=nd(rng);
27 for (size_t i=0;i<b1.size();++i) b1[i]=b1_init[i]=0.0;
28 for (size_t i=0;i<W2.size();++i) W2[i]=W2_init[i]=nd(rng);
29 for (size_t i=0;i<b2.size();++i) b2[i]=b2_init[i]=0.0;
30 }
31
32 static double sigmoid(double x){ return 1.0/(1.0+std::exp(-x)); }
33 static double d_relu(double x){ return x>0?1.0:0.0; }
34
35 // Forward for a single sample
36 double forward(const std::vector<double>& x,
37 std::vector<double>& z1,
38 std::vector<double>& a1,
39 double &z2) {
40 z1.assign(hid,0.0);
41 for(int j=0;j<hid;++j){
42 double s=b1[j];
43 for(int i=0;i<in;++i){ s += W1[j*in+i]*x[i]; }
44 z1[j]=s;
45 }
46 a1.resize(hid);
47 for(int j=0;j<hid;++j) a1[j] = std::max(0.0, z1[j]); // ReLU
48 z2 = b2[0];
49 for(int j=0;j<hid;++j) z2 += W2[j]*a1[j];
50 double yhat = sigmoid(z2);
51 return yhat;
52 }
53
54 // SGD step on one sample with BCE loss and masking of pruned weights
55 void sgd_step(const std::vector<double>& x, double y, double lr){
56 std::vector<double> z1, a1; double z2;
57 double yhat = forward(x, z1, a1, z2);
58 // BCE gradient for sigmoid output: dz2 = yhat - y
59 double dz2 = yhat - y;
60 // Gradients for W2 and b2
61 // Update b2
62 b2[0] -= lr * dz2; // we do not prune b2
63 // Update W2 with mask M2
64 for(int j=0;j<hid;++j){
65 double grad = dz2 * a1[j];
66 grad *= M2[j]; // zero gradient for pruned weights
67 W2[j] -= lr * grad;
68 // Keep pruned weights exactly zero
69 if(M2[j]==0.0) W2[j]=0.0;
70 }
71 // Backprop to hidden
72 std::vector<double> da1(hid), dz1(hid);
73 for(int j=0;j<hid;++j) da1[j] = W2[j]*dz2;
74 for(int j=0;j<hid;++j) dz1[j] = da1[j] * d_relu(z1[j]);
75 // Update biases b1 (not pruned)
76 for(int j=0;j<hid;++j) b1[j] -= lr * dz1[j];
77 // Update W1 with mask M1
78 for(int j=0;j<hid;++j){
79 for(int i=0;i<in;++i){
80 double grad = dz1[j] * x[i];
81 size_t idx = j*in + i;
82 grad *= M1[idx];
83 W1[idx] -= lr * grad;
84 if(M1[idx]==0.0) W1[idx]=0.0;
85 }
86 }
87 }
88
89 // Accuracy on dataset with 0.5 threshold
90 double accuracy(const std::vector<std::vector<double>>& X, const std::vector<int>& Y){
91 int correct=0; std::vector<double> z1,a1; double z2;
92 for(size_t n=0;n<X.size();++n){
93 double p = forward(X[n], z1, a1, z2);
94 int pred = p>=0.5 ? 1:0;
95 if(pred==Y[n]) ++correct;
96 }
97 return double(correct)/X.size();
98 }
99
100 // Train for epochs (SGD over repeated small dataset)
101 void train(const std::vector<std::vector<double>>& X,
102 const std::vector<int>& Y,
103 int epochs, double lr){
104 for(int e=0;e<epochs;++e){
105 for(size_t n=0;n<X.size();++n){ sgd_step(X[n], (double)Y[n], lr); }
106 }
107 }
108
109 // Global magnitude pruning to reach target sparsity p in (0,1). Only W1, W2 are pruned.
110 void prune_global(double p){
111 std::vector<double> mags; mags.reserve(W1.size()+W2.size());
112 for(size_t i=0;i<W1.size();++i) if(M1[i]>0.0) mags.push_back(std::abs(W1[i]));
113 for(size_t i=0;i<W2.size();++i) if(M2[i]>0.0) mags.push_back(std::abs(W2[i]));
114 if(mags.empty()) return;
115 size_t k = (size_t)std::floor(p * mags.size());
116 if(k==0) return;
117 std::nth_element(mags.begin(), mags.begin()+k-1, mags.end());
118 double tau = mags[k-1];
119 // Apply pruning
120 for(size_t i=0;i<W1.size();++i){
121 if(M1[i]>0.0 && std::abs(W1[i])<=tau){ M1[i]=0.0; W1[i]=0.0; }
122 }
123 for(size_t i=0;i<W2.size();++i){
124 if(M2[i]>0.0 && std::abs(W2[i])<=tau){ M2[i]=0.0; W2[i]=0.0; }
125 }
126 }
127
128 // Rewind surviving parameters to initial values; keep zeros for pruned entries
129 void rewind_to_init(){
130 for(size_t i=0;i<W1.size();++i) W1[i] = M1[i]*W1_init[i];
131 for(size_t i=0;i<b1.size();++i) b1[i] = b1_init[i];
132 for(size_t i=0;i<W2.size();++i) W2[i] = M2[i]*W2_init[i];
133 for(size_t i=0;i<b2.size();++i) b2[i] = b2_init[i];
134 }
135
136 double current_sparsity() const{
137 size_t kept=0,total=0;
138 for(size_t i=0;i<W1.size();++i){ if(M1[i]>0.0) ++kept; ++total; }
139 for(size_t i=0;i<W2.size();++i){ if(M2[i]>0.0) ++kept; ++total; }
140 return 1.0 - double(kept)/double(total);
141 }
142};
143
144int main(){
145 // XOR dataset (duplicated to simulate epochs)
146 std::vector<std::vector<double>> X = {{0,0},{0,1},{1,0},{1,1}};
147 std::vector<int> Y = {0,1,1,0};
148 // Repeat dataset 50 times to have more steps per epoch
149 std::vector<std::vector<double>> Xrep; std::vector<int> Yrep;
150 for(int r=0;r<50;++r){ for(size_t i=0;i<X.size();++i){ Xrep.push_back(X[i]); Yrep.push_back(Y[i]); } }
151
152 MLP net(2, 8, 1, 123);
153
154 // 1) Train dense
155 net.train(Xrep, Yrep, 100, 0.1);
156 double acc_dense = net.accuracy(X, Y);
157 std::cout << "Dense accuracy: " << acc_dense << "\n";
158
159 // 2) Prune globally to 80% sparsity among weights
160 net.prune_global(0.8);
161 std::cout << "Sparsity after pruning: " << net.current_sparsity() << "\n";
162
163 // 3) Rewind surviving weights to initial values
164 net.rewind_to_init();
165
166 // 4) Retrain only the sparse subnetwork
167 net.train(Xrep, Yrep, 100, 0.1);
168 double acc_ticket = net.accuracy(X, Y);
169 std::cout << "Sparse (ticket) accuracy: " << acc_ticket << "\n";
170
171 return 0;
172}
173

This program builds a tiny 2-layer MLP, trains it densely on XOR, then applies global unstructured magnitude pruning to reach a target sparsity. It rewinds the surviving weights to their initial values and retrains, as required by the Lottery Ticket protocol. Masks prevent pruned weights from changing. The final accuracy of the sparse ticket is compared to the dense baseline to see if a winning ticket emerges.

Time: Training O(E N P) where P is the number of prunable weights (here roughly in*hid + hid*out). Pruning threshold via nth_element is O(P), and applying masks is O(P).Space: O(P) for parameters plus O(P) for masks and optionally another O(P) for storing initial weights for rewinding.
Iterative Magnitude Pruning with Rewinding and Sparsity Schedule
1#include <iostream>
2#include <vector>
3#include <random>
4#include <algorithm>
5#include <cmath>
6#include <numeric>
7
8// Same MLP as before, with iterative pruning rounds and reporting
9struct MLP {
10 int in, hid, out;
11 std::vector<double> W1, b1, W2, b2;
12 std::vector<double> W1_init, b1_init, W2_init, b2_init;
13 std::vector<double> M1, M2;
14 std::mt19937 rng;
15
16 MLP(int in_, int hid_, int out_, unsigned seed=7): in(in_), hid(hid_), out(out_),
17 W1(in_*hid_), b1(hid_), W2(hid_*out_), b2(out_),
18 W1_init(in_*hid_), b1_init(hid_), W2_init(hid_*out_), b2_init(out_),
19 M1(in_*hid_, 1.0), M2(hid_*out_, 1.0), rng(seed) {
20 std::normal_distribution<double> nd(0.0, 0.6);
21 for(size_t i=0;i<W1.size();++i) W1[i]=W1_init[i]=nd(rng);
22 for(size_t i=0;i<b1.size();++i) b1[i]=b1_init[i]=0.0;
23 for(size_t i=0;i<W2.size();++i) W2[i]=W2_init[i]=nd(rng);
24 for(size_t i=0;i<b2.size();++i) b2[i]=b2_init[i]=0.0;
25 }
26
27 static double sigmoid(double x){ return 1.0/(1.0+std::exp(-x)); }
28 static double d_relu(double x){ return x>0?1.0:0.0; }
29
30 double forward(const std::vector<double>& x,
31 std::vector<double>& z1,
32 std::vector<double>& a1,
33 double &z2) {
34 z1.assign(hid,0.0);
35 for(int j=0;j<hid;++j){
36 double s=b1[j];
37 for(int i=0;i<in;++i) s += W1[j*in+i]*x[i];
38 z1[j]=s;
39 }
40 a1.resize(hid);
41 for(int j=0;j<hid;++j) a1[j] = std::max(0.0, z1[j]);
42 z2 = b2[0];
43 for(int j=0;j<hid;++j) z2 += W2[j]*a1[j];
44 return sigmoid(z2);
45 }
46
47 void sgd_step(const std::vector<double>& x, double y, double lr){
48 std::vector<double> z1, a1; double z2;
49 double yhat = forward(x, z1, a1, z2);
50 double dz2 = yhat - y;
51 b2[0] -= lr * dz2; // biases not pruned
52 for(int j=0;j<hid;++j){
53 size_t idx=j;
54 double grad=dz2*a1[j];
55 grad*=M2[idx];
56 W2[idx]-=lr*grad;
57 if(M2[idx]==0.0) W2[idx]=0.0;
58 }
59 std::vector<double> da1(hid), dz1(hid);
60 for(int j=0;j<hid;++j) da1[j]=W2[j]*dz2;
61 for(int j=0;j<hid;++j) dz1[j]=da1[j]*d_relu(z1[j]);
62 for(int j=0;j<hid;++j) b1[j]-=lr*dz1[j];
63 for(int j=0;j<hid;++j){
64 for(int i=0;i<in;++i){
65 size_t idx=j*in+i;
66 double grad=dz1[j]*x[i];
67 grad*=M1[idx];
68 W1[idx]-=lr*grad;
69 if(M1[idx]==0.0) W1[idx]=0.0;
70 }
71 }
72 }
73
74 void train(const std::vector<std::vector<double>>& X,
75 const std::vector<int>& Y,
76 int epochs, double lr){
77 for(int e=0;e<epochs;++e){
78 for(size_t n=0;n<X.size();++n) sgd_step(X[n], (double)Y[n], lr);
79 }
80 }
81
82 double accuracy(const std::vector<std::vector<double>>& X, const std::vector<int>& Y){
83 int c=0; std::vector<double> z1,a1; double z2;
84 for(size_t n=0;n<X.size();++n){
85 double p=forward(X[n], z1, a1, z2);
86 c += (p>=0.5) == (Y[n]==1);
87 }
88 return double(c)/X.size();
89 }
90
91 // Prune an additional fraction p_add of currently kept weights (unstructured global)
92 void prune_iterative(double p_add){
93 std::vector<double> mags; mags.reserve(W1.size()+W2.size());
94 for(size_t i=0;i<W1.size();++i) if(M1[i]>0.0) mags.push_back(std::abs(W1[i]));
95 for(size_t i=0;i<W2.size();++i) if(M2[i]>0.0) mags.push_back(std::abs(W2[i]));
96 size_t alive = mags.size();
97 if(alive==0) return;
98 size_t k = (size_t)std::floor(p_add * alive);
99 if(k==0) return;
100 std::nth_element(mags.begin(), mags.begin()+k-1, mags.end());
101 double tau = mags[k-1];
102 for(size_t i=0;i<W1.size();++i) if(M1[i]>0.0 && std::abs(W1[i])<=tau){ M1[i]=0.0; W1[i]=0.0; }
103 for(size_t i=0;i<W2.size();++i) if(M2[i]>0.0 && std::abs(W2[i])<=tau){ M2[i]=0.0; W2[i]=0.0; }
104 }
105
106 void rewind_to_init(){
107 for(size_t i=0;i<W1.size();++i) W1[i]=M1[i]*W1_init[i];
108 for(size_t i=0;i<b1.size();++i) b1[i]=b1_init[i];
109 for(size_t i=0;i<W2.size();++i) W2[i]=M2[i]*W2_init[i];
110 for(size_t i=0;i<b2.size();++i) b2[i]=b2_init[i];
111 }
112
113 double sparsity() const{
114 size_t kept=0,total=0;
115 for(size_t i=0;i<W1.size();++i){ kept += (M1[i]>0.0); ++total; }
116 for(size_t i=0;i<W2.size();++i){ kept += (M2[i]>0.0); ++total; }
117 return 1.0 - double(kept)/double(total);
118 }
119};
120
121int main(){
122 // XOR data repeated
123 std::vector<std::vector<double>> X = {{0,0},{0,1},{1,0},{1,1}};
124 std::vector<int> Y = {0,1,1,0};
125 std::vector<std::vector<double>> Xrep; std::vector<int> Yrep;
126 for(int r=0;r<100;++r){ for(size_t i=0;i<X.size();++i){ Xrep.push_back(X[i]); Yrep.push_back(Y[i]); } }
127
128 MLP net(2, 16, 1, 2024);
129
130 // Baseline dense training
131 net.train(Xrep, Yrep, 60, 0.1);
132 std::cout << "Dense acc: " << net.accuracy(X, Y) << "\n";
133
134 // Iterative pruning schedule: prune 20% of remaining weights each round
135 double p_add = 0.2; int rounds = 4; // final sparsity ~ 1 - (0.8)^4 ~ 59%
136 for(int r=0;r<rounds;++r){
137 // Prune based on magnitudes after training
138 net.prune_iterative(p_add);
139 std::cout << "After prune round " << r+1 << ", sparsity = " << net.sparsity() << "\n";
140 // Rewind and retrain the ticket
141 net.rewind_to_init();
142 net.train(Xrep, Yrep, 60, 0.1);
143 std::cout << " Ticket acc: " << net.accuracy(X, Y) << "\n";
144 }
145
146 return 0;
147}
148

This program performs iterative magnitude pruning: after training, it prunes an additional fraction of currently alive weights each round, rewinds to the initial values for the survivors, and retrains. It reports sparsity and accuracy per round, illustrating how gradual pruning often preserves accuracy better than a single large prune and can surface a stable winning ticket.

Time: Each round adds O(E N P_alive) training where P_alive shrinks with sparsity, plus O(P_alive) selection for the threshold. Over R rounds, total is roughly O(E N P) for the first round plus decreasing costs thereafter.Space: O(P) for parameters, O(P) for masks, and O(P) for storing initial parameters used by rewinding.
#lottery ticket hypothesis#magnitude pruning#sparsity#rewinding#winning ticket#unstructured pruning#structured pruning#binary mask#quantile selection#nth_element#overparameterization#sgd#binary cross-entropy#model compression#generalization