📚TheoryIntermediate

Lottery Ticket Hypothesis

Key Points

•
The Lottery Ticket Hypothesis (LTH) says that inside a large dense neural network there exist small sparse subnetworks that, when trained in isolation from their original initialization, can reach comparable accuracy to the full model.
•
These subnetworks are found by training the dense model, pruning the smallest-magnitude weights, and resetting the remaining weights to their initial values before retraining.
•
A binary mask m selects which parameters stay; pruned weights are fixed to zero and never updated during training.
•
Unstructured pruning (individual weights) commonly reveals winning tickets at high sparsity, while structured pruning (entire neurons/filters) is more hardware-friendly but usually needs lower sparsity to work well.
•
Winning tickets tend to train faster in early epochs and can generalize as well as the dense model when found properly.
•
Rewinding to the exact initial weights is the classic LTH; variants rewind to an early training iteration to stabilize deeper networks.
•
Magnitude pruning can be implemented efficiently using selection (nt $h_{e} l e m e n t$ ), making the search computationally practical.
•
Although LTH reduces inference cost through sparsity, the search still requires at least one dense training run (and often several pruning rounds).

Prerequisites

→Gradient-based optimization (SGD) — Understanding how weights are updated and how loss gradients flow is essential to implement pruning and rewinding correctly.
→Neural network fundamentals — Knowing layers, activations, and parameter counts helps reason about sparsity and its impact.
→Overparameterization and regularization — Motivates why large models can contain effective sparse subnetworks and how pruning acts as implicit regularization.
→Probability and random initialization — The hypothesis depends on the luck of the draw at initialization; reproducibility requires seed control.
→Algorithmic selection and sorting — Efficient pruning uses quantile selection (nth_element) rather than full sorting to compute thresholds.

Detailed Explanation

Tap terms for definitions

01Overview

The Lottery Ticket Hypothesis (LTH) is an idea about why large neural networks work so well. It claims that within an overparameterized dense network there exist sparse subnetworks—called winning tickets—that, when trained alone from their original initialization, can reach test accuracy comparable to the full network. This means many parameters are not strictly necessary for learning; they act like a safety net that makes optimization easier, while a much smaller subset is actually sufficient to learn the task.

Practically, a winning ticket is discovered by three steps: train the dense network, prune a fraction of small-magnitude weights to create a binary mask, and reset (rewind) the surviving weights to their original initial values before retraining. If the resulting sparse network matches the dense model’s accuracy with far fewer parameters, we have identified a winning ticket.

LTH provides a lens to understand overparameterization and optimization: large networks are easy to train because they hide many good sparse solutions within them. It also suggests a path to computational savings, especially at inference time, by deploying only the sparse subnetwork. However, it does not eliminate the upfront cost of the initial training needed to discover the mask, and the benefits depend on pruning strategy, sparsity level, and the problem domain.

02Intuition & Analogies

Imagine you buy a huge roll of scratch-off lottery tickets. You don’t need all of them to win; you just need the right few. A large dense network is like that roll—full of many potential subnetworks with different initial weights. Training the large model is like scratching all the tickets just enough to see which ones look promising. Pruning removes the duds, leaving the tickets that can actually win. Resetting to the original numbers on those winning tickets and playing them again is like training the subnetwork alone from its original initialization.

Another analogy is sculpting from a block of marble. The full dense network is the block; pruning is the chisel removing unneeded pieces. The sculpture (the winning subnetwork) was always inside the block—you just had to remove the extra material to reveal it. Importantly, you don’t rebuild the sculpture from fresh stone; you return to the block’s original structure for the parts you keep (rewinding to initial weights), because those original shapes help the final piece form well during training.

Finally, think of a mixing console with hundreds of knobs (weights). The exact combination to produce a clean sound may only require a smaller set of knobs. Tuning all knobs first helps you learn which ones matter (training dense); then you tape over the irrelevant knobs (mask), reset the important ones to their starting positions (rewind), and re-tune just those to quickly achieve the same sound quality (retraining the sparse net).

03Formal Definition

Consider a parameter vector w

\in

R^{P}

initialized as

w_{0}

by a random initializer and trained by SGD (or similar) with learning rate schedule

η

(t) on dataset

D

. A binary mask m

\in

\{0,1\}^{P} defines a subnetwork by elementwise product m

⊙

w. The Lottery Ticket Hypothesis states: there exists a mask m such that, when training only parameters indexed by m from their initial values m

⊙

w_{0}

using the same optimizer and schedule, the resulting test accuracy and convergence speed are comparable to those of training the full dense network from

w_{0}

. Operationally, a common way to find m is magnitude pruning: after training for some epochs (or to completion), compute a threshold

τ

so that a target sparsity is reached, and set

m_{i}

1

(

∣ w_{i} ∣

τ

). The ‘rewinding’ step replaces current weights with the original initialization on the kept indices: w

\leftarrow

⊙

w_{0}

. Variants called weight rewinding reset to an early training checkpoint

w_{k}

(k small) rather than

w_{0}

, which can improve stability in deeper nets. A subnetwork that matches accuracy is called a winning ticket; its sparsity is

s = 1

- \

∣ m ∥

_{0}/P. Empirically, winning tickets often exhibit faster early training (lower loss in initial epochs) and comparable generalization to the dense model across image and NLP tasks at moderate to high sparsities, especially with unstructured pruning.

04When to Use

Use LTH-inspired pruning when you want to deploy lighter models without significantly sacrificing accuracy, especially when inference latency, memory, or energy are constrained (e.g., mobile apps, embedded devices). It is also useful for model exploration: by revealing which connections matter, pruning can improve interpretability and guide architecture design. In research settings, LTH provides hypotheses to test about overparameterization and optimization landscapes.

Practically, choose unstructured magnitude pruning to maximize sparsity and accuracy with minimal code changes, particularly when your hardware or library supports sparse tensors. Use structured pruning (whole neurons, channels, or attention heads) when you need speed on dense hardware like standard GPUs/CPUs without specialized sparse kernels. Apply iterative pruning (small percentages over multiple rounds) if one-shot pruning hurts accuracy, or when seeking very high sparsity. Consider rewinding to a small early-epoch checkpoint when training very deep networks, as initialization rewinding may be unstable there.

Avoid LTH for tiny datasets or underparameterized models where pruning removes critical capacity. Also avoid when training cost is extremely high and you cannot afford the initial dense run, unless you can reuse a pre-trained model or perform pruning during fine-tuning.

⚠️Common Mistakes

Pruning too aggressively in one shot: Removing a large fraction of weights at once can destroy signal paths. Prefer iterative pruning (e.g., 10–20% per round) to maintain performance.
Forgetting to rewind: Training the pruned network from its post-training weights is a different setting (fine-tuning a pruned model) and does not test the LTH claim. Proper LTH requires resetting kept weights to their original (or early-iteration) values before retraining.
Pruning biases and normalization parameters indiscriminately: Many implementations prune only kernel weights, not biases or BatchNorm scales, to avoid destabilizing training. Be explicit about what can be pruned.
Ignoring optimizer state: If you rewind weights but not optimizer state (e.g., momentum), you may inadvertently change the training dynamics. For strict LTH tests, also reset optimizer state.
Misinterpreting sparsity: Reporting fraction pruned without clarifying whether it’s per-layer, global, or structured can be misleading. Global magnitude pruning across all layers typically yields better accuracy at a given overall sparsity.
Expecting training-time savings automatically: LTH mainly saves inference cost. Unless your framework exploits sparsity during training, the training-time may not decrease proportionally.
Not controlling randomness: Different seeds can produce different tickets. For fair comparisons, fix seeds, learning rates, and data orders, and repeat over multiple seeds.

Key Formulas

Masked Network

f (x; m ⊙ w)

Explanation: The model’s output uses only parameters selected by the binary mask m. Pruned weights (where $m_{i} = 0$ ) have no effect on the forward pass.

Sparsity

s = 1 - \frac{∥ m ∥ _{0}}{P}

Explanation: Sparsity s is the fraction of parameters that are zero. Here P is the total number of parameters and $∥$ m $∥_{0}$ counts how many are kept (nonzero).

Magnitude Pruning Rule

m_{i} = 1 (∣ w_{i} ∣ > τ), τ = Q_{p} (∣ w ∣)

Explanation: Keep weights whose magnitude exceeds a threshold τ chosen as the p-th quantile of |w| to achieve a target sparsity. This is global unstructured pruning.

Rewinding Step

w \leftarrow m ⊙ w_{0} (rewind to init)

Explanation: After determining the mask using trained weights, set the surviving parameters back to their initial values before retraining the sparse network.

Training Cost Model

T_{train} \approx E \cdot N \cdot C_{fb} (P)

Explanation: Total training time is approximately epochs E times number of samples N times cost per forward-backward pass, which depends on parameter count P.

Binary Cross-Entropy

BCE (y, \overset{y}{^}) = - [y lo g (\overset{y}{^}) + (1 - y) lo g (1 - \overset{y}{^})]

Explanation: Common loss for binary classification with a sigmoid output. Its gradient leads to the simple error term ( $\overset{y}{^}$ - y) at the output layer.

Accuracy

acc = \frac{1}{N} i = 1 \sum N 1 (\overset{y}{^}_{i} = y_{i})

Explanation: Measures the fraction of correct predictions on a dataset of size N. For binary problems, predictions are often thresholded at 0.5.

Quantile via Selection

τ = select_{k} (∣ w ∣), k = ⌊ pP ⌋

Explanation: The pruning threshold τ can be found by selecting the k-th smallest magnitude using an O(P) selection algorithm like nt $h_{e} l e m e n t$ , avoiding full sort.

Complexity Analysis

Let P be the number of parameters, N the number of training samples, and E the number of epochs. A dense forward-backward pass for a fully connected network costs O(P) arithmetic up to constant factors that depend on layer shapes; thus total dense training is O(E N P). Discovering a lottery ticket typically runs at least one dense training phase to obtain weight magnitudes, so the search has this baseline cost. Magnitude pruning requires computing the absolute values of all prunable parameters, which is O(P). If you sort to get the pruning threshold, it is O(P log P), but using selection (e.g., nt

h_{e} l e m e n t

) reduces it to expected O(P) and worst-case O(P) on average practical inputs. Applying the mask and zeroing pruned weights is linear, O(P). Retraining the sparse subnetwork ideally reduces FLOPs proportionally to the number of active parameters, yielding O(E' N (1-s) P), where s is sparsity and E' is the retraining epochs (often similar to E). However, actual speedups depend on hardware and libraries: unstructured sparsity may not accelerate dense kernels, while structured sparsity (pruning neurons/channels) can yield wall-clock gains on standard hardware. Memory usage during dense training is O(P) for parameters plus O(A) for activations (A scales with batch size and layer widths). Storing the mask adds O(P) memory. If you keep the initial weights for rewinding, you either store a second copy O(P) or store the random seed and regeneration procedure (O(1) metadata). Overall, the computational bottleneck is the initial dense training; pruning and masking are linear-time overheads amortized over training.

Code Examples

One-shot Lottery Ticket on XOR: Train, Prune by Magnitude, Rewind, Retrain

1 #include <iostream>
2 #include <vector>
3 #include <random>
4 #include <algorithm>
5 #include <cmath>
6 #include <numeric>
7 
8 // Minimal 2-layer MLP for binary classification on XOR
9 // Demonstrates: dense training -> global unstructured magnitude pruning -> rewind to init -> retrain
10 
11 struct MLP {
12     int in, hid, out;
13     // Parameters
14     std::vector<double> W1, b1, W2, b2;
15     // Initial parameters for rewinding
16     std::vector<double> W1_init, b1_init, W2_init, b2_init;
17     // Masks (1=keep, 0=prune). We only prune weights (W1, W2); biases kept by default
18     std::vector<double> M1, M2; // same shapes as W1, W2
19     std::mt19937 rng;
20 
21     MLP(int in_, int hid_, int out_, unsigned seed=42): in(in_), hid(hid_), out(out_),
22         W1(in_*hid_), b1(hid_), W2(hid_*out_), b2(out_),
23         W1_init(in_*hid_), b1_init(hid_), W2_init(hid_*out_), b2_init(out_),
24         M1(in_*hid_, 1.0), M2(hid_*out_, 1.0), rng(seed) {
25         std::normal_distribution<double> nd(0.0, 0.5); // Xavier-ish for tiny net
26         for (size_t i=0;i<W1.size();++i) W1[i]=W1_init[i]=nd(rng);
27         for (size_t i=0;i<b1.size();++i) b1[i]=b1_init[i]=0.0;
28         for (size_t i=0;i<W2.size();++i) W2[i]=W2_init[i]=nd(rng);
29         for (size_t i=0;i<b2.size();++i) b2[i]=b2_init[i]=0.0;
30     }
31 
32     static double sigmoid(double x){ return 1.0/(1.0+std::exp(-x)); }
33     static double d_relu(double x){ return x>0?1.0:0.0; }
34 
35     // Forward for a single sample
36     double forward(const std::vector<double>& x,
37                    std::vector<double>& z1,
38                    std::vector<double>& a1,
39                    double &z2) {
40         z1.assign(hid,0.0);
41         for(int j=0;j<hid;++j){
42             double s=b1[j];
43             for(int i=0;i<in;++i){ s += W1[j*in+i]*x[i]; }
44             z1[j]=s;
45         }
46         a1.resize(hid);
47         for(int j=0;j<hid;++j) a1[j] = std::max(0.0, z1[j]); // ReLU
48         z2 = b2[0];
49         for(int j=0;j<hid;++j) z2 += W2[j]*a1[j];
50         double yhat = sigmoid(z2);
51         return yhat;
52     }
53 
54     // SGD step on one sample with BCE loss and masking of pruned weights
55     void sgd_step(const std::vector<double>& x, double y, double lr){
56         std::vector<double> z1, a1; double z2;
57         double yhat = forward(x, z1, a1, z2);
58         // BCE gradient for sigmoid output: dz2 = yhat - y
59         double dz2 = yhat - y;
60         // Gradients for W2 and b2
61         // Update b2
62         b2[0] -= lr * dz2; // we do not prune b2
63         // Update W2 with mask M2
64         for(int j=0;j<hid;++j){
65             double grad = dz2 * a1[j];
66             grad *= M2[j]; // zero gradient for pruned weights
67             W2[j] -= lr * grad;
68             // Keep pruned weights exactly zero
69             if(M2[j]==0.0) W2[j]=0.0;
70         }
71         // Backprop to hidden
72         std::vector<double> da1(hid), dz1(hid);
73         for(int j=0;j<hid;++j) da1[j] = W2[j]*dz2;
74         for(int j=0;j<hid;++j) dz1[j] = da1[j] * d_relu(z1[j]);
75         // Update biases b1 (not pruned)
76         for(int j=0;j<hid;++j) b1[j] -= lr * dz1[j];
77         // Update W1 with mask M1
78         for(int j=0;j<hid;++j){
79             for(int i=0;i<in;++i){
80                 double grad = dz1[j] * x[i];
81                 size_t idx = j*in + i;
82                 grad *= M1[idx];
83                 W1[idx] -= lr * grad;
84                 if(M1[idx]==0.0) W1[idx]=0.0;
85             }
86         }
87     }
88 
89     // Accuracy on dataset with 0.5 threshold
90     double accuracy(const std::vector<std::vector<double>>& X, const std::vector<int>& Y){
91         int correct=0; std::vector<double> z1,a1; double z2;
92         for(size_t n=0;n<X.size();++n){
93             double p = forward(X[n], z1, a1, z2);
94             int pred = p>=0.5 ? 1:0;
95             if(pred==Y[n]) ++correct;
96         }
97         return double(correct)/X.size();
98     }
99 
100     // Train for epochs (SGD over repeated small dataset)
101     void train(const std::vector<std::vector<double>>& X,
102                const std::vector<int>& Y,
103                int epochs, double lr){
104         for(int e=0;e<epochs;++e){
105             for(size_t n=0;n<X.size();++n){ sgd_step(X[n], (double)Y[n], lr); }
106         }
107     }
108 
109     // Global magnitude pruning to reach target sparsity p in (0,1). Only W1, W2 are pruned.
110     void prune_global(double p){
111         std::vector<double> mags; mags.reserve(W1.size()+W2.size());
112         for(size_t i=0;i<W1.size();++i) if(M1[i]>0.0) mags.push_back(std::abs(W1[i]));
113         for(size_t i=0;i<W2.size();++i) if(M2[i]>0.0) mags.push_back(std::abs(W2[i]));
114         if(mags.empty()) return;
115         size_t k = (size_t)std::floor(p * mags.size());
116         if(k==0) return;
117         std::nth_element(mags.begin(), mags.begin()+k-1, mags.end());
118         double tau = mags[k-1];
119         // Apply pruning
120         for(size_t i=0;i<W1.size();++i){
121             if(M1[i]>0.0 && std::abs(W1[i])<=tau){ M1[i]=0.0; W1[i]=0.0; }
122         }
123         for(size_t i=0;i<W2.size();++i){
124             if(M2[i]>0.0 && std::abs(W2[i])<=tau){ M2[i]=0.0; W2[i]=0.0; }
125         }
126     }
127 
128     // Rewind surviving parameters to initial values; keep zeros for pruned entries
129     void rewind_to_init(){
130         for(size_t i=0;i<W1.size();++i) W1[i] = M1[i]*W1_init[i];
131         for(size_t i=0;i<b1.size();++i) b1[i] = b1_init[i];
132         for(size_t i=0;i<W2.size();++i) W2[i] = M2[i]*W2_init[i];
133         for(size_t i=0;i<b2.size();++i) b2[i] = b2_init[i];
134     }
135 
136     double current_sparsity() const{
137         size_t kept=0,total=0;
138         for(size_t i=0;i<W1.size();++i){ if(M1[i]>0.0) ++kept; ++total; }
139         for(size_t i=0;i<W2.size();++i){ if(M2[i]>0.0) ++kept; ++total; }
140         return 1.0 - double(kept)/double(total);
141     }
142 };
143 
144 int main(){
145     // XOR dataset (duplicated to simulate epochs)
146     std::vector<std::vector<double>> X = {{0,0},{0,1},{1,0},{1,1}};
147     std::vector<int> Y = {0,1,1,0};
148     // Repeat dataset 50 times to have more steps per epoch
149     std::vector<std::vector<double>> Xrep; std::vector<int> Yrep;
150     for(int r=0;r<50;++r){ for(size_t i=0;i<X.size();++i){ Xrep.push_back(X[i]); Yrep.push_back(Y[i]); } }
151 
152     MLP net(2, 8, 1, 123);
153 
154     // 1) Train dense
155     net.train(Xrep, Yrep, 100, 0.1);
156     double acc_dense = net.accuracy(X, Y);
157     std::cout << "Dense accuracy: " << acc_dense << "\n";
158 
159     // 2) Prune globally to 80% sparsity among weights
160     net.prune_global(0.8);
161     std::cout << "Sparsity after pruning: " << net.current_sparsity() << "\n";
162 
163     // 3) Rewind surviving weights to initial values
164     net.rewind_to_init();
165 
166     // 4) Retrain only the sparse subnetwork
167     net.train(Xrep, Yrep, 100, 0.1);
168     double acc_ticket = net.accuracy(X, Y);
169     std::cout << "Sparse (ticket) accuracy: " << acc_ticket << "\n";
170 
171     return 0;
172 }
173

This program builds a tiny 2-layer MLP, trains it densely on XOR, then applies global unstructured magnitude pruning to reach a target sparsity. It rewinds the surviving weights to their initial values and retrains, as required by the Lottery Ticket protocol. Masks prevent pruned weights from changing. The final accuracy of the sparse ticket is compared to the dense baseline to see if a winning ticket emerges.

Time: Training O(E N P) where P is the number of prunable weights (here roughly in*hid + hid*out). Pruning threshold via nth_element is O(P), and applying masks is O(P).Space: O(P) for parameters plus O(P) for masks and optionally another O(P) for storing initial weights for rewinding.

Iterative Magnitude Pruning with Rewinding and Sparsity Schedule

1 #include <iostream>
2 #include <vector>
3 #include <random>
4 #include <algorithm>
5 #include <cmath>
6 #include <numeric>
7 
8 // Same MLP as before, with iterative pruning rounds and reporting
9 struct MLP {
10     int in, hid, out;
11     std::vector<double> W1, b1, W2, b2;
12     std::vector<double> W1_init, b1_init, W2_init, b2_init;
13     std::vector<double> M1, M2;
14     std::mt19937 rng;
15 
16     MLP(int in_, int hid_, int out_, unsigned seed=7): in(in_), hid(hid_), out(out_),
17         W1(in_*hid_), b1(hid_), W2(hid_*out_), b2(out_),
18         W1_init(in_*hid_), b1_init(hid_), W2_init(hid_*out_), b2_init(out_),
19         M1(in_*hid_, 1.0), M2(hid_*out_, 1.0), rng(seed) {
20         std::normal_distribution<double> nd(0.0, 0.6);
21         for(size_t i=0;i<W1.size();++i) W1[i]=W1_init[i]=nd(rng);
22         for(size_t i=0;i<b1.size();++i) b1[i]=b1_init[i]=0.0;
23         for(size_t i=0;i<W2.size();++i) W2[i]=W2_init[i]=nd(rng);
24         for(size_t i=0;i<b2.size();++i) b2[i]=b2_init[i]=0.0;
25     }
26 
27     static double sigmoid(double x){ return 1.0/(1.0+std::exp(-x)); }
28     static double d_relu(double x){ return x>0?1.0:0.0; }
29 
30     double forward(const std::vector<double>& x,
31                    std::vector<double>& z1,
32                    std::vector<double>& a1,
33                    double &z2) {
34         z1.assign(hid,0.0);
35         for(int j=0;j<hid;++j){
36             double s=b1[j];
37             for(int i=0;i<in;++i) s += W1[j*in+i]*x[i];
38             z1[j]=s;
39         }
40         a1.resize(hid);
41         for(int j=0;j<hid;++j) a1[j] = std::max(0.0, z1[j]);
42         z2 = b2[0];
43         for(int j=0;j<hid;++j) z2 += W2[j]*a1[j];
44         return sigmoid(z2);
45     }
46 
47     void sgd_step(const std::vector<double>& x, double y, double lr){
48         std::vector<double> z1, a1; double z2;
49         double yhat = forward(x, z1, a1, z2);
50         double dz2 = yhat - y;
51         b2[0] -= lr * dz2; // biases not pruned
52         for(int j=0;j<hid;++j){
53             size_t idx=j;
54             double grad=dz2*a1[j];
55             grad*=M2[idx];
56             W2[idx]-=lr*grad;
57             if(M2[idx]==0.0) W2[idx]=0.0;
58         }
59         std::vector<double> da1(hid), dz1(hid);
60         for(int j=0;j<hid;++j) da1[j]=W2[j]*dz2;
61         for(int j=0;j<hid;++j) dz1[j]=da1[j]*d_relu(z1[j]);
62         for(int j=0;j<hid;++j) b1[j]-=lr*dz1[j];
63         for(int j=0;j<hid;++j){
64             for(int i=0;i<in;++i){
65                 size_t idx=j*in+i;
66                 double grad=dz1[j]*x[i];
67                 grad*=M1[idx];
68                 W1[idx]-=lr*grad;
69                 if(M1[idx]==0.0) W1[idx]=0.0;
70             }
71         }
72     }
73 
74     void train(const std::vector<std::vector<double>>& X,
75                const std::vector<int>& Y,
76                int epochs, double lr){
77         for(int e=0;e<epochs;++e){
78             for(size_t n=0;n<X.size();++n) sgd_step(X[n], (double)Y[n], lr);
79         }
80     }
81 
82     double accuracy(const std::vector<std::vector<double>>& X, const std::vector<int>& Y){
83         int c=0; std::vector<double> z1,a1; double z2;
84         for(size_t n=0;n<X.size();++n){
85             double p=forward(X[n], z1, a1, z2);
86             c += (p>=0.5) == (Y[n]==1);
87         }
88         return double(c)/X.size();
89     }
90 
91     // Prune an additional fraction p_add of currently kept weights (unstructured global)
92     void prune_iterative(double p_add){
93         std::vector<double> mags; mags.reserve(W1.size()+W2.size());
94         for(size_t i=0;i<W1.size();++i) if(M1[i]>0.0) mags.push_back(std::abs(W1[i]));
95         for(size_t i=0;i<W2.size();++i) if(M2[i]>0.0) mags.push_back(std::abs(W2[i]));
96         size_t alive = mags.size();
97         if(alive==0) return;
98         size_t k = (size_t)std::floor(p_add * alive);
99         if(k==0) return;
100         std::nth_element(mags.begin(), mags.begin()+k-1, mags.end());
101         double tau = mags[k-1];
102         for(size_t i=0;i<W1.size();++i) if(M1[i]>0.0 && std::abs(W1[i])<=tau){ M1[i]=0.0; W1[i]=0.0; }
103         for(size_t i=0;i<W2.size();++i) if(M2[i]>0.0 && std::abs(W2[i])<=tau){ M2[i]=0.0; W2[i]=0.0; }
104     }
105 
106     void rewind_to_init(){
107         for(size_t i=0;i<W1.size();++i) W1[i]=M1[i]*W1_init[i];
108         for(size_t i=0;i<b1.size();++i) b1[i]=b1_init[i];
109         for(size_t i=0;i<W2.size();++i) W2[i]=M2[i]*W2_init[i];
110         for(size_t i=0;i<b2.size();++i) b2[i]=b2_init[i];
111     }
112 
113     double sparsity() const{
114         size_t kept=0,total=0;
115         for(size_t i=0;i<W1.size();++i){ kept += (M1[i]>0.0); ++total; }
116         for(size_t i=0;i<W2.size();++i){ kept += (M2[i]>0.0); ++total; }
117         return 1.0 - double(kept)/double(total);
118     }
119 };
120 
121 int main(){
122     // XOR data repeated
123     std::vector<std::vector<double>> X = {{0,0},{0,1},{1,0},{1,1}};
124     std::vector<int> Y = {0,1,1,0};
125     std::vector<std::vector<double>> Xrep; std::vector<int> Yrep;
126     for(int r=0;r<100;++r){ for(size_t i=0;i<X.size();++i){ Xrep.push_back(X[i]); Yrep.push_back(Y[i]); } }
127 
128     MLP net(2, 16, 1, 2024);
129 
130     // Baseline dense training
131     net.train(Xrep, Yrep, 60, 0.1);
132     std::cout << "Dense acc: " << net.accuracy(X, Y) << "\n";
133 
134     // Iterative pruning schedule: prune 20% of remaining weights each round
135     double p_add = 0.2; int rounds = 4; // final sparsity ~ 1 - (0.8)^4 ~ 59%
136     for(int r=0;r<rounds;++r){
137         // Prune based on magnitudes after training
138         net.prune_iterative(p_add);
139         std::cout << "After prune round " << r+1 << ", sparsity = " << net.sparsity() << "\n";
140         // Rewind and retrain the ticket
141         net.rewind_to_init();
142         net.train(Xrep, Yrep, 60, 0.1);
143         std::cout << "  Ticket acc: " << net.accuracy(X, Y) << "\n";
144     }
145 
146     return 0;
147 }
148

This program performs iterative magnitude pruning: after training, it prunes an additional fraction of currently alive weights each round, rewinds to the initial values for the survivors, and retrains. It reports sparsity and accuracy per round, illustrating how gradual pruning often preserves accuracy better than a single large prune and can surface a stable winning ticket.

Time: Each round adds O(E N P_alive) training where P_alive shrinks with sparsity, plus O(P_alive) selection for the threshold. Over R rounds, total is roughly O(E N P) for the first round plus decreasing costs thereafter.Space: O(P) for parameters, O(P) for masks, and O(P) for storing initial parameters used by rewinding.

1	#include <iostream>
2	#include <vector>
3	#include <random>
4	#include <algorithm>
5	#include <cmath>
6	#include <numeric>
7
8	// Minimal 2-layer MLP for binary classification on XOR
9	// Demonstrates: dense training -> global unstructured magnitude pruning -> rewind to init -> retrain
10
11	struct MLP {
12	int in, hid, out;
13	// Parameters
14	std::vector<double> W1, b1, W2, b2;
15	// Initial parameters for rewinding
16	std::vector<double> W1_init, b1_init, W2_init, b2_init;
17	// Masks (1=keep, 0=prune). We only prune weights (W1, W2); biases kept by default
18	std::vector<double> M1, M2; // same shapes as W1, W2
19	std::mt19937 rng;
20
21	MLP(int in_, int hid_, int out_, unsigned seed=42): in(in_), hid(hid_), out(out_),
22	W1(in_hid_), b1(hid_), W2(hid_out_), b2(out_),
23	W1_init(in_hid_), b1_init(hid_), W2_init(hid_out_), b2_init(out_),
24	M1(in_hid_, 1.0), M2(hid_out_, 1.0), rng(seed) {
25	std::normal_distribution<double> nd(0.0, 0.5); // Xavier-ish for tiny net
26	for (size_t i=0;i<W1.size();++i) W1[i]=W1_init[i]=nd(rng);
27	for (size_t i=0;i<b1.size();++i) b1[i]=b1_init[i]=0.0;
28	for (size_t i=0;i<W2.size();++i) W2[i]=W2_init[i]=nd(rng);
29	for (size_t i=0;i<b2.size();++i) b2[i]=b2_init[i]=0.0;
30	}
31
32	static double sigmoid(double x){ return 1.0/(1.0+std::exp(-x)); }
33	static double d_relu(double x){ return x>0?1.0:0.0; }
34
35	// Forward for a single sample
36	double forward(const std::vector<double>& x,
37	std::vector<double>& z1,
38	std::vector<double>& a1,
39	double &z2) {
40	z1.assign(hid,0.0);
41	for(int j=0;j<hid;++j){
42	double s=b1[j];
43	for(int i=0;i<in;++i){ s += W1[jin+i]x[i]; }
44	z1[j]=s;
45	}
46	a1.resize(hid);
47	for(int j=0;j<hid;++j) a1[j] = std::max(0.0, z1[j]); // ReLU
48	z2 = b2[0];
49	for(int j=0;j<hid;++j) z2 += W2[j]*a1[j];
50	double yhat = sigmoid(z2);
51	return yhat;
52	}
53
54	// SGD step on one sample with BCE loss and masking of pruned weights
55	void sgd_step(const std::vector<double>& x, double y, double lr){
56	std::vector<double> z1, a1; double z2;
57	double yhat = forward(x, z1, a1, z2);
58	// BCE gradient for sigmoid output: dz2 = yhat - y
59	double dz2 = yhat - y;
60	// Gradients for W2 and b2
61	// Update b2
62	b2[0] -= lr * dz2; // we do not prune b2
63	// Update W2 with mask M2
64	for(int j=0;j<hid;++j){
65	double grad = dz2 * a1[j];
66	grad *= M2[j]; // zero gradient for pruned weights
67	W2[j] -= lr * grad;
68	// Keep pruned weights exactly zero
69	if(M2[j]==0.0) W2[j]=0.0;
70	}
71	// Backprop to hidden
72	std::vector<double> da1(hid), dz1(hid);
73	for(int j=0;j<hid;++j) da1[j] = W2[j]*dz2;
74	for(int j=0;j<hid;++j) dz1[j] = da1[j] * d_relu(z1[j]);
75	// Update biases b1 (not pruned)
76	for(int j=0;j<hid;++j) b1[j] -= lr * dz1[j];
77	// Update W1 with mask M1
78	for(int j=0;j<hid;++j){
79	for(int i=0;i<in;++i){
80	double grad = dz1[j] * x[i];
81	size_t idx = j*in + i;
82	grad *= M1[idx];
83	W1[idx] -= lr * grad;
84	if(M1[idx]==0.0) W1[idx]=0.0;
85	}
86	}
87	}
88
89	// Accuracy on dataset with 0.5 threshold
90	double accuracy(const std::vector<std::vector<double>>& X, const std::vector<int>& Y){
91	int correct=0; std::vector<double> z1,a1; double z2;
92	for(size_t n=0;n<X.size();++n){
93	double p = forward(X[n], z1, a1, z2);
94	int pred = p>=0.5 ? 1:0;
95	if(pred==Y[n]) ++correct;
96	}
97	return double(correct)/X.size();
98	}
99
100	// Train for epochs (SGD over repeated small dataset)
101	void train(const std::vector<std::vector<double>>& X,
102	const std::vector<int>& Y,
103	int epochs, double lr){
104	for(int e=0;e<epochs;++e){
105	for(size_t n=0;n<X.size();++n){ sgd_step(X[n], (double)Y[n], lr); }
106	}
107	}
108
109	// Global magnitude pruning to reach target sparsity p in (0,1). Only W1, W2 are pruned.
110	void prune_global(double p){
111	std::vector<double> mags; mags.reserve(W1.size()+W2.size());
112	for(size_t i=0;i<W1.size();++i) if(M1[i]>0.0) mags.push_back(std::abs(W1[i]));
113	for(size_t i=0;i<W2.size();++i) if(M2[i]>0.0) mags.push_back(std::abs(W2[i]));
114	if(mags.empty()) return;
115	size_t k = (size_t)std::floor(p * mags.size());
116	if(k==0) return;
117	std::nth_element(mags.begin(), mags.begin()+k-1, mags.end());
118	double tau = mags[k-1];
119	// Apply pruning
120	for(size_t i=0;i<W1.size();++i){
121	if(M1[i]>0.0 && std::abs(W1[i])<=tau){ M1[i]=0.0; W1[i]=0.0; }
122	}
123	for(size_t i=0;i<W2.size();++i){
124	if(M2[i]>0.0 && std::abs(W2[i])<=tau){ M2[i]=0.0; W2[i]=0.0; }
125	}
126	}
127
128	// Rewind surviving parameters to initial values; keep zeros for pruned entries
129	void rewind_to_init(){
130	for(size_t i=0;i<W1.size();++i) W1[i] = M1[i]*W1_init[i];
131	for(size_t i=0;i<b1.size();++i) b1[i] = b1_init[i];
132	for(size_t i=0;i<W2.size();++i) W2[i] = M2[i]*W2_init[i];
133	for(size_t i=0;i<b2.size();++i) b2[i] = b2_init[i];
134	}
135
136	double current_sparsity() const{
137	size_t kept=0,total=0;
138	for(size_t i=0;i<W1.size();++i){ if(M1[i]>0.0) ++kept; ++total; }
139	for(size_t i=0;i<W2.size();++i){ if(M2[i]>0.0) ++kept; ++total; }
140	return 1.0 - double(kept)/double(total);
141	}
142	};
143
144	int main(){
145	// XOR dataset (duplicated to simulate epochs)
146	std::vector<std::vector<double>> X = {{0,0},{0,1},{1,0},{1,1}};
147	std::vector<int> Y = {0,1,1,0};
148	// Repeat dataset 50 times to have more steps per epoch
149	std::vector<std::vector<double>> Xrep; std::vector<int> Yrep;
150	for(int r=0;r<50;++r){ for(size_t i=0;i<X.size();++i){ Xrep.push_back(X[i]); Yrep.push_back(Y[i]); } }
151
152	MLP net(2, 8, 1, 123);
153
154	// 1) Train dense
155	net.train(Xrep, Yrep, 100, 0.1);
156	double acc_dense = net.accuracy(X, Y);
157	std::cout << "Dense accuracy: " << acc_dense << "\n";
158
159	// 2) Prune globally to 80% sparsity among weights
160	net.prune_global(0.8);
161	std::cout << "Sparsity after pruning: " << net.current_sparsity() << "\n";
162
163	// 3) Rewind surviving weights to initial values
164	net.rewind_to_init();
165
166	// 4) Retrain only the sparse subnetwork
167	net.train(Xrep, Yrep, 100, 0.1);
168	double acc_ticket = net.accuracy(X, Y);
169	std::cout << "Sparse (ticket) accuracy: " << acc_ticket << "\n";
170
171	return 0;
172	}
173