📚TheoryAdvanced

Information Bottleneck

Key Points

•
The Information Bottleneck (IB) principle formalizes the tradeoff between compressing an input X and preserving information about a target Y using the objective $mi n_{p (t ∣ x)}$ I(X;T) - $β$ I(T;Y).
•
Mutual information I(A;B) measures how much knowing A reduces uncertainty about B, and $β$ controls how aggressively we compress X while keeping predictive power for Y.
•
For discrete variables, the Blahut–Arimoto algorithm efficiently finds soft clusters T that balance compression and prediction using KL divergence as a distortion.
•
IB is closely related to rate–distortion theory with a task-aware distortion d(x,t) = $D_{K L}$ (p(y $∣ x) ∥$ p(y|t)).
•
In practice, variational methods (VIB) optimize a stochastic encoder q_ $ϕ$ (z $∣ x) an d p re d i c t or p_{θ} (y ∣$ z) with a KL regularizer that upper-bounds I(X;Z).
•
The information plane plots I(X;T) versus I(T;Y) to visualize the tradeoff; solutions lie on a convex frontier for optimal $β$ values.
•
Common pitfalls include misestimating mutual information, using too few clusters/latent units, and ignoring numerical stability in iterative updates.
•
C++ implementations can tackle the discrete IB (BA iterations) and small variational models (VIB with reparameterization) for pedagogical experiments.

Prerequisites

→Probability distributions and expectations — IB and VIB manipulate joint, marginal, and conditional probabilities and require expectations over random variables.
→Entropy, mutual information, and KL divergence — Core quantities in the IB objective and BA updates are defined using entropy and KL.
→Optimization and Lagrange multipliers — IB’s tradeoff arises from a constrained optimization that connects to rate–distortion via Lagrangian duality.
→Markov chains and Data Processing Inequality — The IB assumes Y - X - T; reasoning about information flow relies on DPI.
→Numerical stability in probabilistic computation — Handling zeros, underflow, and normalization is essential in BA iterations and MI calculations.
→Gradient-based learning (for VIB) — VIB uses stochastic gradients and reparameterization to optimize the variational objective.
→Logistic regression and cross-entropy — The VIB classifier commonly optimizes cross-entropy using logistic or softmax outputs.

Detailed Explanation

Tap terms for definitions

01Overview

The Information Bottleneck (IB) principle is a framework for learning compact yet predictive representations. Given an input random variable X and a task variable Y (such as labels), we seek an intermediate representation T that discards irrelevant details in X while preserving the aspects of X that are useful to predict Y. IB turns this idea into an optimization problem: minimize the mutual information between X and T (which quantifies compression) while maximizing the mutual information between T and Y (which quantifies predictive content). A single tradeoff parameter, beta (\beta), balances these two aims. At \beta = 0 the model cares only about prediction, and as \beta increases it increasingly prefers compressed representations.

For discrete variables, IB yields a soft clustering of X into a limited number of codes T, where each cluster retains as much information about Y as possible. For continuous variables and high-dimensional data, exact mutual information is intractable, so variational approximations such as the Variational Information Bottleneck (VIB) are used. VIB introduces a stochastic encoder that maps X into a latent Z with an information-regularizing KL penalty. IB connects to classical rate–distortion theory by treating task-relevant distortion as the KL divergence between conditional label distributions. The principle underlies modern ideas in representation learning, generalization, and robust compression.

02Intuition & Analogies

Imagine you are explaining a complex movie (X) to a friend who only cares about the plot twist (Y). You must summarize the movie (create T) without telling every scene. If you include too many details, your summary is long (low compression); if you omit too much, your friend won’t understand the twist (low prediction). The Information Bottleneck asks: what is the shortest, most focused summary that still preserves what’s needed to predict the twist?

Another analogy: packing a suitcase. You can’t bring everything (compression), but you must include what you’ll actually need (prediction). The parameter \beta is like the strictness of airline baggage limits—higher \beta forces you to pack less. If you pack too light, you miss essentials; if you overpack, you carry unnecessary weight. IB formalizes this balance by measuring how much your summary T still tells you about Y compared to the original X.

In clustering terms, IB forms groups of inputs that are indistinguishable from Y’s perspective. Two inputs belong in the same cluster if they imply similar predictions about Y. The “distance” between inputs and clusters is not geometric but informational: how different their label distributions are. This is why the KL divergence between p(y|x) and p(y|t) plays the role of distortion. The IB algorithm iteratively refines clusters so that clusters become prototypical for predicting Y while using as few distinct clusters (or as little information about X) as possible.

03Formal Definition

Let X, Y be random variables and T a stochastic representation produced by an encoder p(t|x). The Information Bottleneck objective is \[

min_{p (t ∣ x)}

\; I(X;T) -

β

\, I(T;Y), \] subject to the Markov chain Y - X - T (i.e., T depends on Y only through X). Here, I(A;B) denotes mutual information. For discrete variables, \[ I(A;B) =

\sum_{a, b}

p(a,b)

lo g

\frac{p ( a , b )}{p ( a ) p ( b )}

. \] IB is equivalent (via Lagrangian duality) to a constrained problem: minimize I(X;T) subject to I(T;Y)

\geq

I_{0}

, or to a rate–distortion form \[

min_{p (t ∣ x)}

\; I(X;T) +

β

E

[d(X,T)],

d(x,t) =

D_{K L}

(p(y|x) \

∣ p (y ∣

t)), \] where

D_{K L}

is the Kullback–Leibler divergence. For discrete alphabets, stationary points satisfy the Blahut–Arimoto (BA) style updates: \[ p(t) =

\sum_{x}

p(x)p(t

∣ x), p (y ∣

t) =

\frac{1}{p ( t )}

\sum_{x}

p(y

∣ x) p (x) p (t ∣

x), \] \[ p(t|x) =

\frac{p ( t ) e x p ( - β D _{K L} ( p ( y ∣ x ) ∥ p ( y ∣ t )) )}{Z ( x )}

, \] with normalizer Z(x) ensuring

\sum_{t}

p(t|x)=1. The solution traces a convex information plane curve relating I(X;T) and I(T;Y) as

β

varies.

04When to Use

Task-aware compression: When you want to compress inputs while maintaining performance on a specific predictive task, IB provides a principled target.
Clustering for prediction: If you need soft clusters that are maximally informative about labels, the discrete IB (via BA iterations) is appropriate.
Representation learning: IB/VIB yields representations that discard nuisance factors, improving robustness and generalization under distribution shift and noise.
Model selection and regularization: The \beta parameter acts like an information-based regularizer; tuning \beta trades model complexity for predictive sufficiency.
Interpretable summaries: IB can extract small, human-readable codebooks T that summarize how inputs relate to outcomes.
Communication and coding: In settings akin to source coding with side information, IB connects to rate–distortion and can guide system design when distortion is task-relevance rather than Euclidean error.
Low-data regimes: By enforcing parsimony (small I(X;T)), IB can reduce overfitting when data is scarce, especially with variational bounds that are computationally tractable.

⚠️Common Mistakes

Confusing compression with dimensionality: Lower-dimensional T does not guarantee lower I(X;T); stochastic encoders and noise often do the real compressing.
Misestimating mutual information: Naive plug-in estimates are biased; prefer exact sums for discrete distributions or variational/nearest-neighbor estimators for continuous data.
Ignoring the Markov constraint: The IB theory assumes Y - X - T; learning pipelines that leak Y directly into T violate assumptions and distort the tradeoff.
Numerical instability in BA updates: Failing to normalize p(t|x) properly or allowing zeros in probabilities leads to NaNs; always add small epsilons and renormalize.
Over-regularization with large \beta: Excessive compression can destroy predictive content; monitor I(T;Y), accuracy, or validation loss when sweeping \beta.
Using hard clusters prematurely: IB solutions are soft assignments; forcing hard clusters early can trap the algorithm in poor local optima.
Comparing nats and bits: Mutual information depends on the log base; be consistent (natural log vs. log base 2) when reporting and interpreting values.

Key Formulas

Information Bottleneck Objective

p (t ∣ x) min I (X; T) - β I (T; Y)

Explanation: This is the core tradeoff between compression (I(X;T)) and prediction (I(T;Y)), balanced by the nonnegative parameter $β$ . Larger $β$ favors preserving more about Y at the cost of less compression.

Mutual Information (Discrete)

I (A; B) = a, b \sum p (a, b) lo g \frac{p ( a , b )}{p ( a ) p ( b )}

Explanation: Mutual information sums over all pairs of outcomes and measures how much the joint deviates from the product of marginals. Zero MI means independence.

KL Divergence (Discrete)

D_{K L} (p ∥ q) = i \sum p_{i} lo g \frac{p _{i}}{q _{i}}

Explanation: KL divergence quantifies how well q approximates p, with 0 meaning identical distributions. In IB it acts as a task-informed distortion between conditional label distributions.

Task-Aware Distortion

d (x, t) = D_{K L} (p (y ∣ x) ∥ p (y ∣ t))

Explanation: Two inputs x that induce similar label distributions should map to the same t. The KL divergence captures this notion of predictive similarity.

BA Marginal and Decoder Updates

p (t) = x \sum p (x) p (t ∣ x), p (y ∣ t) = \frac{1}{p ( t )} x \sum p (y ∣ x) p (x) p (t ∣ x)

Explanation: Given an encoder p(t|x), we update the cluster prior p(t) and the decoder p(y|t) by averaging conditional label distributions over inputs assigned to t.

BA Encoder Update

p (t ∣ x) = \frac{p ( t ) exp ( - β D _{K L} ( p ( y ∣ x ) ∥ p ( y ∣ t )) )}{Z ( x )}

Explanation: Inputs x prefer clusters t whose predicted label distribution p(y|t) is close (in KL) to p(y|x), weighted by the cluster size p(t). Z(x) normalizes across t.

MI via Encoder

I (X; T) = x, t \sum p (x) p (t ∣ x) lo g \frac{p ( t ∣ x )}{p ( t )}

Explanation: When T is generated by p(t|x), mutual information simplifies to an expectation of the log-likelihood ratio p(t|x)/p(t). This is convenient in discrete IB.

VIB Objective

L_{VIB} = E_{p (x, y)} E_{q (z ∣ x)} [- lo g p (y ∣ z)] + β E_{p (x)} D_{K L} (q (z ∣ x) ∥ p (z))

Explanation: The variational relaxation replaces the intractable MI with a KL regularizer to a prior p(z). The first term is prediction loss; the second controls compression.

Gaussian KL

D_{K L} (N (μ, Σ) ∥ N (0, I)) = \frac{1}{2} (tr (Σ) + μ^{⊤} μ - k - lo g det Σ)

Explanation: For a k-dimensional Gaussian encoder with diagonal covariance, the KL term has a closed form, enabling efficient optimization in VIB.

Data Processing Inequality

I (T; Y) \leq I (X; Y)

Explanation: Because Y - X - T is a Markov chain, processing X into T cannot create information about Y. IB solutions lie under this bound in the information plane.

Complexity Analysis

For the discrete Information Bottleneck solved via the Blahut–Arimoto (BA) iterations, each update requires computing KL divergences between p(y

∣ x) an d p (y ∣

t) for all pairs (x, t). If we denote

∣ X ∣

∣ Y ∣

, and

∣ T ∣

as the alphabet sizes, then computing all

D_{K L}

values costs O(

∣ X ∣∣ T ∣∣ Y ∣

) time per iteration. Updating the decoder p(y

∣ t) a gg re g a t eso v er X an d cos t s O (∣

∣ T ∣∣ Y ∣

), while updating the encoder p(t

∣ x) an d c l u s t er p r i or p (t) cos t s O (∣

∣ T ∣

). Thus, each full BA iteration is O(

∣ X ∣∣ T ∣∣ Y ∣

), with memory O(

∣ X ∣∣ T ∣

∣ T ∣∣ Y ∣

∣ X ∣∣ Y ∣

) to store p(t

∣ x), p (y ∣

t), and p(y

∣ x) . C o n v er g e n ce t y p i c a ll yocc u rs in t e n s t o h u n d re d so f i t er a t i o n s f ors ma llp ro b l e m s b u t c an s l o w n e a r p ha se t r an s i t i o n s a s β v a r i es . C o m p u t in g m u t u a l in f or ma t i o n e x a c tl y f ro m d i scre t e t ab l escos t s O (∣

∣ B ∣

) for I(A;B) when joint and marginals are available. Numerical stability necessitates careful handling of zeros with small epsilons, which can add constant-time checks but does not change asymptotic complexity. For a small Variational Information Bottleneck (VIB) model with a one-dimensional Gaussian latent, a linear encoder, and a logistic classifier, forward and backward passes are O(d) per sample with d the input dimension (here

d = 1

). Using mini-batches of size B over N samples and E epochs yields O(N E d) total time, with constant-factor overhead for random sampling (reparameterization) and exponentials/logistics. Memory is O(N) for the dataset plus O(1) for parameters, since the model has a few scalars. In higher dimensions, time scales linearly with latent and input dimensions under diagonal-covariance encoders. Variational MI surrogates avoid combinatorial sums but can require more epochs for convergence compared to discrete BA on small alphabets.

Code Examples

Discrete Information Bottleneck via Blahut–Arimoto (soft clustering)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Small epsilon to avoid log(0)
5 const double EPS = 1e-12;
6 
7 // Normalize a vector to sum to 1
8 void normalize(vector<double>& v) {
9     double s = 0.0; for (double x : v) s += x;
10     if (s <= 0) { double u = 1.0 / v.size(); for (double &x : v) x = u; return; }
11     for (double &x : v) x = max(x, EPS);
12     s = accumulate(v.begin(), v.end(), 0.0);
13     for (double &x : v) x /= s;
14 }
15 
16 // KL divergence D_KL(p||q) for discrete distributions (both must be normalized)
17 double kl_div(const vector<double>& p, const vector<double>& q) {
18     double d = 0.0;
19     for (size_t i = 0; i < p.size(); ++i) {
20         double pi = max(p[i], EPS);
21         double qi = max(q[i], EPS);
22         d += pi * (log(pi) - log(qi)); // natural log => nats
23     }
24     return d;
25 }
26 
27 // Compute mutual information I(X;T) given p(x) and p(t|x)
28 double mutual_info_XT(const vector<double>& pX, const vector<vector<double>>& pTgivenX) {
29     size_t X = pX.size();
30     size_t T = pTgivenX[0].size();
31     vector<double> pT(T, 0.0);
32     for (size_t x = 0; x < X; ++x) {
33         for (size_t t = 0; t < T; ++t) pT[t] += pX[x] * pTgivenX[x][t];
34     }
35     double I = 0.0;
36     for (size_t x = 0; x < X; ++x) {
37         for (size_t t = 0; t < T; ++t) {
38             double joint = pX[x] * pTgivenX[x][t];
39             if (joint < EPS) continue;
40             I += joint * (log(pTgivenX[x][t]) - log(max(pT[t], EPS)));
41         }
42     }
43     return I; // in nats
44 }
45 
46 // Compute mutual information I(T;Y) given p(t) and p(y|t)
47 double mutual_info_TY(const vector<double>& pT, const vector<vector<double>>& pYgivenT) {
48     size_t T = pT.size();
49     size_t Y = pYgivenT[0].size();
50     vector<double> pY(Y, 0.0);
51     for (size_t t = 0; t < T; ++t) {
52         for (size_t y = 0; y < Y; ++y) pY[y] += pT[t] * pYgivenT[t][y];
53     }
54     double I = 0.0;
55     for (size_t t = 0; t < T; ++t) {
56         for (size_t y = 0; y < Y; ++y) {
57             double joint = pT[t] * pYgivenT[t][y];
58             if (joint < EPS) continue;
59             I += joint * (log(pYgivenT[t][y]) - log(max(pY[y], EPS)));
60         }
61     }
62     return I; // in nats
63 }
64 
65 int main() {
66     ios::sync_with_stdio(false);
67     cin.tie(nullptr);
68 
69     // Example: small discrete IB problem
70     // X has 4 values, Y has 2 values (binary labels), T has K clusters
71     size_t X = 4, Y = 2, T = 2; // try T=2 clusters
72     double beta = 5.0;          // tradeoff parameter (higher => more predictive)
73     int maxIters = 200;
74 
75     // Define p(x)
76     vector<double> pX = {0.25, 0.25, 0.25, 0.25};
77 
78     // Define p(y|x): rows sum to 1
79     // Suppose x=0,1 imply mostly y=0; x=2,3 imply mostly y=1
80     vector<vector<double>> pYgivenX = {
81         {0.9, 0.1}, // x=0
82         {0.8, 0.2}, // x=1
83         {0.2, 0.8}, // x=2
84         {0.1, 0.9}  // x=3
85     };
86 
87     // Initialize p(t|x) randomly and normalize rows
88     mt19937 rng(42);
89     uniform_real_distribution<double> uni(0.0, 1.0);
90     vector<vector<double>> pTgivenX(X, vector<double>(T, 0.0));
91     for (size_t x = 0; x < X; ++x) {
92         for (size_t t = 0; t < T; ++t) pTgivenX[x][t] = uni(rng) + EPS;
93         normalize(pTgivenX[x]);
94     }
95 
96     // Allocate p(t) and p(y|t)
97     vector<double> pT(T, 0.0);
98     vector<vector<double>> pYgivenT(T, vector<double>(Y, 0.0));
99 
100     for (int it = 0; it < maxIters; ++it) {
101         // Update p(t)
102         fill(pT.begin(), pT.end(), 0.0);
103         for (size_t x = 0; x < X; ++x)
104             for (size_t t = 0; t < T; ++t)
105                 pT[t] += pX[x] * pTgivenX[x][t];
106         for (size_t t = 0; t < T; ++t) pT[t] = max(pT[t], EPS);
107 
108         // Update p(y|t)
109         for (size_t t = 0; t < T; ++t) fill(pYgivenT[t].begin(), pYgivenT[t].end(), 0.0);
110         for (size_t t = 0; t < T; ++t) {
111             for (size_t x = 0; x < X; ++x) {
112                 double w = pX[x] * pTgivenX[x][t];
113                 for (size_t y = 0; y < Y; ++y) pYgivenT[t][y] += w * pYgivenX[x][y];
114             }
115             // Normalize by p(t)
116             for (size_t y = 0; y < Y; ++y) pYgivenT[t][y] /= pT[t];
117             normalize(pYgivenT[t]);
118         }
119 
120         // Update p(t|x)
121         double maxChange = 0.0;
122         for (size_t x = 0; x < X; ++x) {
123             vector<double> newRow(T, 0.0);
124             for (size_t t = 0; t < T; ++t) {
125                 double d = kl_div(pYgivenX[x], pYgivenT[t]);
126                 double score = log(max(pT[t], EPS)) - beta * d; // work in log-space
127                 newRow[t] = exp(score);
128             }
129             normalize(newRow);
130             for (size_t t = 0; t < T; ++t) maxChange = max(maxChange, fabs(newRow[t] - pTgivenX[x][t]));
131             pTgivenX[x] = newRow;
132         }
133 
134         if (it % 20 == 0 || it == maxIters - 1) {
135             double IXT = mutual_info_XT(pX, pTgivenX);
136             double ITY = mutual_info_TY(pT, pYgivenT);
137             cout << "Iter " << it << ": I(X;T)=" << IXT << " nats, I(T;Y)=" << ITY << " nats\n";
138         }
139 
140         if (maxChange < 1e-8) break; // converged
141     }
142 
143     // Print resulting clusters p(y|t)
144     cout << fixed << setprecision(4);
145     for (size_t t = 0; t < T; ++t) {
146         cout << "Cluster t=" << t << ", p(t)=" << pT[t] << ", p(y|t)=[" << pYgivenT[t][0] << ", " << pYgivenT[t][1] << "]\n";
147     }
148 
149     // Print soft assignments p(t|x)
150     for (size_t x = 0; x < X; ++x) {
151         cout << "x=" << x << " -> ";
152         for (size_t t = 0; t < T; ++t) cout << "p(t=" << t << "|x)=" << pTgivenX[x][t] << (t+1==T?"\n":" ");
153     }
154 
155     return 0;
156 }
157

This program solves the discrete Information Bottleneck using Blahut–Arimoto iterations. It initializes a soft encoder p(t|x), alternates updates for the cluster prior p(t), decoder p(y|t), and encoder p(t|x) with the KL-based distortion, and reports I(X;T) and I(T;Y) during optimization. The example clusters four inputs into two task-aware clusters based on their binary label distributions. All computations use natural logs (nats).

Time: O(|X||T||Y|) per iteration, typically tens to hundreds of iterations.Space: O(|X||T| + |T||Y| + |X||Y|).

Variational Information Bottleneck (1D Gaussian encoder + logistic classifier)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct RNG {
5     mt19937 gen; normal_distribution<double> norm;
6     RNG(unsigned seed=42): gen(seed), norm(0.0,1.0) {}
7     double gauss() { return norm(gen); }
8 };
9 
10 // Sigmoid and stable log-sigmoid helpers
11 inline double sigmoid(double x){ return 1.0/(1.0+exp(-x)); }
12 inline double log1pexp(double x){ if(x>0) return x+log1p(exp(-x)); else return log1p(exp(x)); }
13 
14 int main(){
15     ios::sync_with_stdio(false);
16     cin.tie(nullptr);
17 
18     // Synthetic data: x ~ N(0,1), y = 1[x + noise > 0], noise ~ N(0, 0.5^2)
19     RNG rng(123);
20     const int N = 2000;
21     vector<double> X(N); vector<int> Y(N);
22     for(int i=0;i<N;++i){
23         double x = rng.gauss();
24         double noise = 0.5 * rng.gauss();
25         int y = (x + noise > 0.0) ? 1 : 0;
26         X[i]=x; Y[i]=y;
27     }
28 
29     // Model parameters: 1D encoder q(z|x)=N(mu(x), sigma^2(x)), with mu=wmu*x+bmu, logvar=l=wlv*x+blv
30     // Classifier: p(y=1|z)=sigmoid(wc*z+bc)
31     double wmu=0.1, bmu=0.0, wlv=-1.0, blv=-1.0; // start with small variance (logvar ~ -1)
32     double wc=0.1, bc=0.0;
33 
34     const double beta = 1.0;   // bottleneck strength
35     const double lr = 1e-2;    // learning rate
36     const int epochs = 200;
37     const int B = 100;         // mini-batch size
38 
39     vector<int> idx(N); iota(idx.begin(), idx.end(), 0);
40 
41     for(int ep=0; ep<epochs; ++ep){
42         // Shuffle data
43         shuffle(idx.begin(), idx.end(), rng.gen);
44 
45         double epoch_loss=0.0, epoch_acc=0.0, epoch_kl=0.0;
46 
47         for(int s=0; s<N; s+=B){
48             int bEnd = min(N, s+B);
49             // Accumulate gradients
50             double gwmu=0, gbmu=0, gwlv=0, gblv=0, gwc=0, gbc=0;
51             double batch_loss=0.0, batch_acc=0.0, batch_kl=0.0;
52 
53             for(int k=s; k<bEnd; ++k){
54                 int i = idx[k];
55                 double x = X[i]; int y = Y[i];
56 
57                 // Encoder parameters
58                 double mu = wmu*x + bmu;
59                 double l = wlv*x + blv;           // log-variance
60                 double sigma2 = exp(l);
61                 double sigma = sqrt(max(sigma2, 1e-12));
62 
63                 // Reparameterization
64                 double eps = rng.gauss();
65                 double z = mu + sigma * eps;
66 
67                 // Classifier
68                 double logit = wc*z + bc;
69                 double p1 = sigmoid(logit);
70 
71                 // Binary cross-entropy: -[ y*log p1 + (1-y)*log(1-p1) ]
72                 double ce = - ( y*log(max(p1,1e-12)) + (1-y)*log(max(1.0-p1,1e-12)) );
73 
74                 // KL(q(z|x)||N(0,1)) for 1D Gaussian: 0.5*(mu^2 + sigma^2 - log sigma^2 - 1)
75                 double kl = 0.5*(mu*mu + sigma2 - l - 1.0);
76 
77                 double loss = ce + beta * kl;
78                 batch_loss += loss;
79                 batch_kl += kl;
80                 batch_acc += ( (p1>0.5) == (y==1) ) ? 1.0 : 0.0;
81 
82                 // Gradients
83                 double dL_dlogit = p1 - y; // derivative of CE wrt logit
84                 // Classifier grads
85                 gwc += dL_dlogit * z;
86                 gbc += dL_dlogit;
87                 // Through z
88                 double dL_dz = dL_dlogit * wc;
89                 // z = mu + sigma*eps
90                 double dL_dmu = dL_dz + beta * mu; // CE part via z + KL part
91                 double dL_dsigma = dL_dz * eps;    // CE part via z (dz/dsigma=eps)
92                 // For KL, using l=log sigma^2 => kl = 0.5*(mu^2 + exp(l) - l - 1)
93                 double dKL_dl = 0.5*(exp(l) - 1.0);
94                 // sigma = exp(0.5*l) => dsigma/dl = 0.5*sigma
95                 double dCE_dl = dL_dsigma * 0.5 * sigma; // only CE flows via sigma
96                 double dL_dl = dCE_dl + beta * dKL_dl;
97 
98                 // Back to affine params
99                 gwmu += dL_dmu * x;
100                 gbmu += dL_dmu;
101                 gwlv += dL_dl * x;
102                 gblv += dL_dl;
103             }
104 
105             int bs = bEnd - s;
106             // Average gradients
107             gwmu /= bs; gbmu /= bs; gwlv /= bs; gblv /= bs; gwc /= bs; gbc /= bs;
108             batch_loss /= bs; batch_acc /= bs; batch_kl /= bs;
109 
110             // SGD updates
111             wmu -= lr * gwmu; bmu -= lr * gbmu;
112             wlv -= lr * gwlv; blv -= lr * gblv;
113             wc  -= lr * gwc;  bc  -= lr * gbc;
114 
115             epoch_loss += batch_loss * bs; epoch_acc += batch_acc * bs; epoch_kl += batch_kl * bs;
116         }
117 
118         epoch_loss /= N; epoch_acc /= N; epoch_kl /= N;
119         if (ep % 20 == 0 || ep == epochs-1) {
120             // epoch_kl approximates I(X;Z) upper bound (in nats)
121             cout << "Epoch " << ep
122                  << ": loss=" << epoch_loss
123                  << ", acc=" << epoch_acc
124                  << ", E[KL]~I(X;Z)=" << epoch_kl << " nats\n";
125         }
126     }
127 
128     // Report learned parameters
129     cout << fixed << setprecision(4);
130     cout << "wmu=" << wmu << ", bmu=" << bmu << ", wlv=" << wlv << ", blv=" << blv << "\n";
131     cout << "wc=" << wc << ", bc=" << bc << "\n";
132 
133     return 0;
134 }
135

This program implements a minimal Variational Information Bottleneck for a 1D input. The encoder produces a Gaussian latent z with mean and log-variance linear in x; the classifier predicts y from z via logistic regression. The loss is cross-entropy plus β times the Gaussian KL to a standard normal prior, which upper-bounds I(X;Z). Using the reparameterization trick enables unbiased, low-variance gradients in plain C++. The printed average KL provides a proxy for compression, and accuracy reflects predictive power—tuning β traces a tradeoff.

Time: O(N · epochs) for 1D with constant work per sample; scales linearly with input/latent dimension under diagonal covariance.Space: O(N) for data storage and O(1) for model parameters.

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Small epsilon to avoid log(0)
5	const double EPS = 1e-12;
6
7	// Normalize a vector to sum to 1
8	void normalize(vector<double>& v) {
9	double s = 0.0; for (double x : v) s += x;
10	if (s <= 0) { double u = 1.0 / v.size(); for (double &x : v) x = u; return; }
11	for (double &x : v) x = max(x, EPS);
12	s = accumulate(v.begin(), v.end(), 0.0);
13	for (double &x : v) x /= s;
14	}
15
16	// KL divergence D_KL(p\|\|q) for discrete distributions (both must be normalized)
17	double kl_div(const vector<double>& p, const vector<double>& q) {
18	double d = 0.0;
19	for (size_t i = 0; i < p.size(); ++i) {
20	double pi = max(p[i], EPS);
21	double qi = max(q[i], EPS);
22	d += pi * (log(pi) - log(qi)); // natural log => nats
23	}
24	return d;
25	}
26
27	// Compute mutual information I(X;T) given p(x) and p(t\|x)
28	double mutual_info_XT(const vector<double>& pX, const vector<vector<double>>& pTgivenX) {
29	size_t X = pX.size();
30	size_t T = pTgivenX[0].size();
31	vector<double> pT(T, 0.0);
32	for (size_t x = 0; x < X; ++x) {
33	for (size_t t = 0; t < T; ++t) pT[t] += pX[x] * pTgivenX[x][t];
34	}
35	double I = 0.0;
36	for (size_t x = 0; x < X; ++x) {
37	for (size_t t = 0; t < T; ++t) {
38	double joint = pX[x] * pTgivenX[x][t];
39	if (joint < EPS) continue;
40	I += joint * (log(pTgivenX[x][t]) - log(max(pT[t], EPS)));
41	}
42	}
43	return I; // in nats
44	}
45
46	// Compute mutual information I(T;Y) given p(t) and p(y\|t)
47	double mutual_info_TY(const vector<double>& pT, const vector<vector<double>>& pYgivenT) {
48	size_t T = pT.size();
49	size_t Y = pYgivenT[0].size();
50	vector<double> pY(Y, 0.0);
51	for (size_t t = 0; t < T; ++t) {
52	for (size_t y = 0; y < Y; ++y) pY[y] += pT[t] * pYgivenT[t][y];
53	}
54	double I = 0.0;
55	for (size_t t = 0; t < T; ++t) {
56	for (size_t y = 0; y < Y; ++y) {
57	double joint = pT[t] * pYgivenT[t][y];
58	if (joint < EPS) continue;
59	I += joint * (log(pYgivenT[t][y]) - log(max(pY[y], EPS)));
60	}
61	}
62	return I; // in nats
63	}
64
65	int main() {
66	ios::sync_with_stdio(false);
67	cin.tie(nullptr);
68
69	// Example: small discrete IB problem
70	// X has 4 values, Y has 2 values (binary labels), T has K clusters
71	size_t X = 4, Y = 2, T = 2; // try T=2 clusters
72	double beta = 5.0; // tradeoff parameter (higher => more predictive)
73	int maxIters = 200;
74
75	// Define p(x)
76	vector<double> pX = {0.25, 0.25, 0.25, 0.25};
77
78	// Define p(y\|x): rows sum to 1
79	// Suppose x=0,1 imply mostly y=0; x=2,3 imply mostly y=1
80	vector<vector<double>> pYgivenX = {
81	{0.9, 0.1}, // x=0
82	{0.8, 0.2}, // x=1
83	{0.2, 0.8}, // x=2
84	{0.1, 0.9} // x=3
85	};
86
87	// Initialize p(t\|x) randomly and normalize rows
88	mt19937 rng(42);
89	uniform_real_distribution<double> uni(0.0, 1.0);
90	vector<vector<double>> pTgivenX(X, vector<double>(T, 0.0));
91	for (size_t x = 0; x < X; ++x) {
92	for (size_t t = 0; t < T; ++t) pTgivenX[x][t] = uni(rng) + EPS;
93	normalize(pTgivenX[x]);
94	}
95
96	// Allocate p(t) and p(y\|t)
97	vector<double> pT(T, 0.0);
98	vector<vector<double>> pYgivenT(T, vector<double>(Y, 0.0));
99
100	for (int it = 0; it < maxIters; ++it) {
101	// Update p(t)
102	fill(pT.begin(), pT.end(), 0.0);
103	for (size_t x = 0; x < X; ++x)
104	for (size_t t = 0; t < T; ++t)
105	pT[t] += pX[x] * pTgivenX[x][t];
106	for (size_t t = 0; t < T; ++t) pT[t] = max(pT[t], EPS);
107
108	// Update p(y\|t)
109	for (size_t t = 0; t < T; ++t) fill(pYgivenT[t].begin(), pYgivenT[t].end(), 0.0);
110	for (size_t t = 0; t < T; ++t) {
111	for (size_t x = 0; x < X; ++x) {
112	double w = pX[x] * pTgivenX[x][t];
113	for (size_t y = 0; y < Y; ++y) pYgivenT[t][y] += w * pYgivenX[x][y];
114	}
115	// Normalize by p(t)
116	for (size_t y = 0; y < Y; ++y) pYgivenT[t][y] /= pT[t];
117	normalize(pYgivenT[t]);
118	}
119
120	// Update p(t\|x)
121	double maxChange = 0.0;
122	for (size_t x = 0; x < X; ++x) {
123	vector<double> newRow(T, 0.0);
124	for (size_t t = 0; t < T; ++t) {
125	double d = kl_div(pYgivenX[x], pYgivenT[t]);
126	double score = log(max(pT[t], EPS)) - beta * d; // work in log-space
127	newRow[t] = exp(score);
128	}
129	normalize(newRow);
130	for (size_t t = 0; t < T; ++t) maxChange = max(maxChange, fabs(newRow[t] - pTgivenX[x][t]));
131	pTgivenX[x] = newRow;
132	}
133
134	if (it % 20 == 0 \|\| it == maxIters - 1) {
135	double IXT = mutual_info_XT(pX, pTgivenX);
136	double ITY = mutual_info_TY(pT, pYgivenT);
137	cout << "Iter " << it << ": I(X;T)=" << IXT << " nats, I(T;Y)=" << ITY << " nats\n";
138	}
139
140	if (maxChange < 1e-8) break; // converged
141	}
142
143	// Print resulting clusters p(y\|t)
144	cout << fixed << setprecision(4);
145	for (size_t t = 0; t < T; ++t) {
146	cout << "Cluster t=" << t << ", p(t)=" << pT[t] << ", p(y\|t)=[" << pYgivenT[t][0] << ", " << pYgivenT[t][1] << "]\n";
147	}
148
149	// Print soft assignments p(t\|x)
150	for (size_t x = 0; x < X; ++x) {
151	cout << "x=" << x << " -> ";
152	for (size_t t = 0; t < T; ++t) cout << "p(t=" << t << "\|x)=" << pTgivenX[x][t] << (t+1==T?"\n":" ");
153	}
154
155	return 0;
156	}
157

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct RNG {
5	mt19937 gen; normal_distribution<double> norm;
6	RNG(unsigned seed=42): gen(seed), norm(0.0,1.0) {}
7	double gauss() { return norm(gen); }
8	};
9
10	// Sigmoid and stable log-sigmoid helpers
11	inline double sigmoid(double x){ return 1.0/(1.0+exp(-x)); }
12	inline double log1pexp(double x){ if(x>0) return x+log1p(exp(-x)); else return log1p(exp(x)); }
13
14	int main(){
15	ios::sync_with_stdio(false);
16	cin.tie(nullptr);
17
18	// Synthetic data: x ~ N(0,1), y = 1[x + noise > 0], noise ~ N(0, 0.5^2)
19	RNG rng(123);
20	const int N = 2000;
21	vector<double> X(N); vector<int> Y(N);
22	for(int i=0;i<N;++i){
23	double x = rng.gauss();
24	double noise = 0.5 * rng.gauss();
25	int y = (x + noise > 0.0) ? 1 : 0;
26	X[i]=x; Y[i]=y;
27	}
28
29	// Model parameters: 1D encoder q(z\|x)=N(mu(x), sigma^2(x)), with mu=wmux+bmu, logvar=l=wlvx+blv
30	// Classifier: p(y=1\|z)=sigmoid(wc*z+bc)
31	double wmu=0.1, bmu=0.0, wlv=-1.0, blv=-1.0; // start with small variance (logvar ~ -1)
32	double wc=0.1, bc=0.0;
33
34	const double beta = 1.0; // bottleneck strength
35	const double lr = 1e-2; // learning rate
36	const int epochs = 200;
37	const int B = 100; // mini-batch size
38
39	vector<int> idx(N); iota(idx.begin(), idx.end(), 0);
40
41	for(int ep=0; ep<epochs; ++ep){
42	// Shuffle data
43	shuffle(idx.begin(), idx.end(), rng.gen);
44
45	double epoch_loss=0.0, epoch_acc=0.0, epoch_kl=0.0;
46
47	for(int s=0; s<N; s+=B){
48	int bEnd = min(N, s+B);
49	// Accumulate gradients
50	double gwmu=0, gbmu=0, gwlv=0, gblv=0, gwc=0, gbc=0;
51	double batch_loss=0.0, batch_acc=0.0, batch_kl=0.0;
52
53	for(int k=s; k<bEnd; ++k){
54	int i = idx[k];
55	double x = X[i]; int y = Y[i];
56
57	// Encoder parameters
58	double mu = wmu*x + bmu;
59	double l = wlv*x + blv; // log-variance
60	double sigma2 = exp(l);
61	double sigma = sqrt(max(sigma2, 1e-12));
62
63	// Reparameterization
64	double eps = rng.gauss();
65	double z = mu + sigma * eps;
66
67	// Classifier
68	double logit = wc*z + bc;
69	double p1 = sigmoid(logit);
70
71	// Binary cross-entropy: -[ ylog p1 + (1-y)log(1-p1) ]
72	double ce = - ( ylog(max(p1,1e-12)) + (1-y)log(max(1.0-p1,1e-12)) );
73
74	// KL(q(z\|x)\|\|N(0,1)) for 1D Gaussian: 0.5*(mu^2 + sigma^2 - log sigma^2 - 1)
75	double kl = 0.5(mumu + sigma2 - l - 1.0);
76
77	double loss = ce + beta * kl;
78	batch_loss += loss;
79	batch_kl += kl;
80	batch_acc += ( (p1>0.5) == (y==1) ) ? 1.0 : 0.0;
81
82	// Gradients
83	double dL_dlogit = p1 - y; // derivative of CE wrt logit
84	// Classifier grads
85	gwc += dL_dlogit * z;
86	gbc += dL_dlogit;
87	// Through z
88	double dL_dz = dL_dlogit * wc;
89	// z = mu + sigma*eps
90	double dL_dmu = dL_dz + beta * mu; // CE part via z + KL part
91	double dL_dsigma = dL_dz * eps; // CE part via z (dz/dsigma=eps)
92	// For KL, using l=log sigma^2 => kl = 0.5*(mu^2 + exp(l) - l - 1)
93	double dKL_dl = 0.5*(exp(l) - 1.0);
94	// sigma = exp(0.5l) => dsigma/dl = 0.5sigma
95	double dCE_dl = dL_dsigma * 0.5 * sigma; // only CE flows via sigma
96	double dL_dl = dCE_dl + beta * dKL_dl;
97
98	// Back to affine params
99	gwmu += dL_dmu * x;
100	gbmu += dL_dmu;
101	gwlv += dL_dl * x;
102	gblv += dL_dl;
103	}
104
105	int bs = bEnd - s;
106	// Average gradients
107	gwmu /= bs; gbmu /= bs; gwlv /= bs; gblv /= bs; gwc /= bs; gbc /= bs;
108	batch_loss /= bs; batch_acc /= bs; batch_kl /= bs;
109
110	// SGD updates
111	wmu -= lr * gwmu; bmu -= lr * gbmu;
112	wlv -= lr * gwlv; blv -= lr * gblv;
113	wc -= lr * gwc; bc -= lr * gbc;
114
115	epoch_loss += batch_loss * bs; epoch_acc += batch_acc * bs; epoch_kl += batch_kl * bs;
116	}
117
118	epoch_loss /= N; epoch_acc /= N; epoch_kl /= N;
119	if (ep % 20 == 0 \|\| ep == epochs-1) {
120	// epoch_kl approximates I(X;Z) upper bound (in nats)
121	cout << "Epoch " << ep
122	<< ": loss=" << epoch_loss
123	<< ", acc=" << epoch_acc
124	<< ", E[KL]~I(X;Z)=" << epoch_kl << " nats\n";
125	}
126	}
127
128	// Report learned parameters
129	cout << fixed << setprecision(4);
130	cout << "wmu=" << wmu << ", bmu=" << bmu << ", wlv=" << wlv << ", blv=" << blv << "\n";
131	cout << "wc=" << wc << ", bc=" << bc << "\n";
132
133	return 0;
134	}
135