Information Bottleneck
Key Points
- •The Information Bottleneck (IB) principle formalizes the tradeoff between compressing an input X and preserving information about a target Y using the objective I(X;T) - I(T;Y).
- •Mutual information I(A;B) measures how much knowing A reduces uncertainty about B, and controls how aggressively we compress X while keeping predictive power for Y.
- •For discrete variables, the Blahut–Arimoto algorithm efficiently finds soft clusters T that balance compression and prediction using KL divergence as a distortion.
- •IB is closely related to rate–distortion theory with a task-aware distortion d(x,t) = (p(y p(y|t)).
- •In practice, variational methods (VIB) optimize a stochastic encoder q_(zz) with a KL regularizer that upper-bounds I(X;Z).
- •The information plane plots I(X;T) versus I(T;Y) to visualize the tradeoff; solutions lie on a convex frontier for optimal values.
- •Common pitfalls include misestimating mutual information, using too few clusters/latent units, and ignoring numerical stability in iterative updates.
- •C++ implementations can tackle the discrete IB (BA iterations) and small variational models (VIB with reparameterization) for pedagogical experiments.
Prerequisites
- →Probability distributions and expectations — IB and VIB manipulate joint, marginal, and conditional probabilities and require expectations over random variables.
- →Entropy, mutual information, and KL divergence — Core quantities in the IB objective and BA updates are defined using entropy and KL.
- →Optimization and Lagrange multipliers — IB’s tradeoff arises from a constrained optimization that connects to rate–distortion via Lagrangian duality.
- →Markov chains and Data Processing Inequality — The IB assumes Y - X - T; reasoning about information flow relies on DPI.
- →Numerical stability in probabilistic computation — Handling zeros, underflow, and normalization is essential in BA iterations and MI calculations.
- →Gradient-based learning (for VIB) — VIB uses stochastic gradients and reparameterization to optimize the variational objective.
- →Logistic regression and cross-entropy — The VIB classifier commonly optimizes cross-entropy using logistic or softmax outputs.
Detailed Explanation
Tap terms for definitions01Overview
The Information Bottleneck (IB) principle is a framework for learning compact yet predictive representations. Given an input random variable X and a task variable Y (such as labels), we seek an intermediate representation T that discards irrelevant details in X while preserving the aspects of X that are useful to predict Y. IB turns this idea into an optimization problem: minimize the mutual information between X and T (which quantifies compression) while maximizing the mutual information between T and Y (which quantifies predictive content). A single tradeoff parameter, beta (\beta), balances these two aims. At \beta = 0 the model cares only about prediction, and as \beta increases it increasingly prefers compressed representations.
For discrete variables, IB yields a soft clustering of X into a limited number of codes T, where each cluster retains as much information about Y as possible. For continuous variables and high-dimensional data, exact mutual information is intractable, so variational approximations such as the Variational Information Bottleneck (VIB) are used. VIB introduces a stochastic encoder that maps X into a latent Z with an information-regularizing KL penalty. IB connects to classical rate–distortion theory by treating task-relevant distortion as the KL divergence between conditional label distributions. The principle underlies modern ideas in representation learning, generalization, and robust compression.
02Intuition & Analogies
Imagine you are explaining a complex movie (X) to a friend who only cares about the plot twist (Y). You must summarize the movie (create T) without telling every scene. If you include too many details, your summary is long (low compression); if you omit too much, your friend won’t understand the twist (low prediction). The Information Bottleneck asks: what is the shortest, most focused summary that still preserves what’s needed to predict the twist?
Another analogy: packing a suitcase. You can’t bring everything (compression), but you must include what you’ll actually need (prediction). The parameter \beta is like the strictness of airline baggage limits—higher \beta forces you to pack less. If you pack too light, you miss essentials; if you overpack, you carry unnecessary weight. IB formalizes this balance by measuring how much your summary T still tells you about Y compared to the original X.
In clustering terms, IB forms groups of inputs that are indistinguishable from Y’s perspective. Two inputs belong in the same cluster if they imply similar predictions about Y. The “distance” between inputs and clusters is not geometric but informational: how different their label distributions are. This is why the KL divergence between p(y|x) and p(y|t) plays the role of distortion. The IB algorithm iteratively refines clusters so that clusters become prototypical for predicting Y while using as few distinct clusters (or as little information about X) as possible.
03Formal Definition
04When to Use
- Task-aware compression: When you want to compress inputs while maintaining performance on a specific predictive task, IB provides a principled target.
- Clustering for prediction: If you need soft clusters that are maximally informative about labels, the discrete IB (via BA iterations) is appropriate.
- Representation learning: IB/VIB yields representations that discard nuisance factors, improving robustness and generalization under distribution shift and noise.
- Model selection and regularization: The \beta parameter acts like an information-based regularizer; tuning \beta trades model complexity for predictive sufficiency.
- Interpretable summaries: IB can extract small, human-readable codebooks T that summarize how inputs relate to outcomes.
- Communication and coding: In settings akin to source coding with side information, IB connects to rate–distortion and can guide system design when distortion is task-relevance rather than Euclidean error.
- Low-data regimes: By enforcing parsimony (small I(X;T)), IB can reduce overfitting when data is scarce, especially with variational bounds that are computationally tractable.
⚠️Common Mistakes
- Confusing compression with dimensionality: Lower-dimensional T does not guarantee lower I(X;T); stochastic encoders and noise often do the real compressing.
- Misestimating mutual information: Naive plug-in estimates are biased; prefer exact sums for discrete distributions or variational/nearest-neighbor estimators for continuous data.
- Ignoring the Markov constraint: The IB theory assumes Y - X - T; learning pipelines that leak Y directly into T violate assumptions and distort the tradeoff.
- Numerical instability in BA updates: Failing to normalize p(t|x) properly or allowing zeros in probabilities leads to NaNs; always add small epsilons and renormalize.
- Over-regularization with large \beta: Excessive compression can destroy predictive content; monitor I(T;Y), accuracy, or validation loss when sweeping \beta.
- Using hard clusters prematurely: IB solutions are soft assignments; forcing hard clusters early can trap the algorithm in poor local optima.
- Comparing nats and bits: Mutual information depends on the log base; be consistent (natural log vs. log base 2) when reporting and interpreting values.
Key Formulas
Information Bottleneck Objective
Explanation: This is the core tradeoff between compression (I(X;T)) and prediction (I(T;Y)), balanced by the nonnegative parameter . Larger favors preserving more about Y at the cost of less compression.
Mutual Information (Discrete)
Explanation: Mutual information sums over all pairs of outcomes and measures how much the joint deviates from the product of marginals. Zero MI means independence.
KL Divergence (Discrete)
Explanation: KL divergence quantifies how well q approximates p, with 0 meaning identical distributions. In IB it acts as a task-informed distortion between conditional label distributions.
Task-Aware Distortion
Explanation: Two inputs x that induce similar label distributions should map to the same t. The KL divergence captures this notion of predictive similarity.
BA Marginal and Decoder Updates
Explanation: Given an encoder p(t|x), we update the cluster prior p(t) and the decoder p(y|t) by averaging conditional label distributions over inputs assigned to t.
BA Encoder Update
Explanation: Inputs x prefer clusters t whose predicted label distribution p(y|t) is close (in KL) to p(y|x), weighted by the cluster size p(t). Z(x) normalizes across t.
MI via Encoder
Explanation: When T is generated by p(t|x), mutual information simplifies to an expectation of the log-likelihood ratio p(t|x)/p(t). This is convenient in discrete IB.
VIB Objective
Explanation: The variational relaxation replaces the intractable MI with a KL regularizer to a prior p(z). The first term is prediction loss; the second controls compression.
Gaussian KL
Explanation: For a k-dimensional Gaussian encoder with diagonal covariance, the KL term has a closed form, enabling efficient optimization in VIB.
Data Processing Inequality
Explanation: Because Y - X - T is a Markov chain, processing X into T cannot create information about Y. IB solutions lie under this bound in the information plane.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Small epsilon to avoid log(0) 5 const double EPS = 1e-12; 6 7 // Normalize a vector to sum to 1 8 void normalize(vector<double>& v) { 9 double s = 0.0; for (double x : v) s += x; 10 if (s <= 0) { double u = 1.0 / v.size(); for (double &x : v) x = u; return; } 11 for (double &x : v) x = max(x, EPS); 12 s = accumulate(v.begin(), v.end(), 0.0); 13 for (double &x : v) x /= s; 14 } 15 16 // KL divergence D_KL(p||q) for discrete distributions (both must be normalized) 17 double kl_div(const vector<double>& p, const vector<double>& q) { 18 double d = 0.0; 19 for (size_t i = 0; i < p.size(); ++i) { 20 double pi = max(p[i], EPS); 21 double qi = max(q[i], EPS); 22 d += pi * (log(pi) - log(qi)); // natural log => nats 23 } 24 return d; 25 } 26 27 // Compute mutual information I(X;T) given p(x) and p(t|x) 28 double mutual_info_XT(const vector<double>& pX, const vector<vector<double>>& pTgivenX) { 29 size_t X = pX.size(); 30 size_t T = pTgivenX[0].size(); 31 vector<double> pT(T, 0.0); 32 for (size_t x = 0; x < X; ++x) { 33 for (size_t t = 0; t < T; ++t) pT[t] += pX[x] * pTgivenX[x][t]; 34 } 35 double I = 0.0; 36 for (size_t x = 0; x < X; ++x) { 37 for (size_t t = 0; t < T; ++t) { 38 double joint = pX[x] * pTgivenX[x][t]; 39 if (joint < EPS) continue; 40 I += joint * (log(pTgivenX[x][t]) - log(max(pT[t], EPS))); 41 } 42 } 43 return I; // in nats 44 } 45 46 // Compute mutual information I(T;Y) given p(t) and p(y|t) 47 double mutual_info_TY(const vector<double>& pT, const vector<vector<double>>& pYgivenT) { 48 size_t T = pT.size(); 49 size_t Y = pYgivenT[0].size(); 50 vector<double> pY(Y, 0.0); 51 for (size_t t = 0; t < T; ++t) { 52 for (size_t y = 0; y < Y; ++y) pY[y] += pT[t] * pYgivenT[t][y]; 53 } 54 double I = 0.0; 55 for (size_t t = 0; t < T; ++t) { 56 for (size_t y = 0; y < Y; ++y) { 57 double joint = pT[t] * pYgivenT[t][y]; 58 if (joint < EPS) continue; 59 I += joint * (log(pYgivenT[t][y]) - log(max(pY[y], EPS))); 60 } 61 } 62 return I; // in nats 63 } 64 65 int main() { 66 ios::sync_with_stdio(false); 67 cin.tie(nullptr); 68 69 // Example: small discrete IB problem 70 // X has 4 values, Y has 2 values (binary labels), T has K clusters 71 size_t X = 4, Y = 2, T = 2; // try T=2 clusters 72 double beta = 5.0; // tradeoff parameter (higher => more predictive) 73 int maxIters = 200; 74 75 // Define p(x) 76 vector<double> pX = {0.25, 0.25, 0.25, 0.25}; 77 78 // Define p(y|x): rows sum to 1 79 // Suppose x=0,1 imply mostly y=0; x=2,3 imply mostly y=1 80 vector<vector<double>> pYgivenX = { 81 {0.9, 0.1}, // x=0 82 {0.8, 0.2}, // x=1 83 {0.2, 0.8}, // x=2 84 {0.1, 0.9} // x=3 85 }; 86 87 // Initialize p(t|x) randomly and normalize rows 88 mt19937 rng(42); 89 uniform_real_distribution<double> uni(0.0, 1.0); 90 vector<vector<double>> pTgivenX(X, vector<double>(T, 0.0)); 91 for (size_t x = 0; x < X; ++x) { 92 for (size_t t = 0; t < T; ++t) pTgivenX[x][t] = uni(rng) + EPS; 93 normalize(pTgivenX[x]); 94 } 95 96 // Allocate p(t) and p(y|t) 97 vector<double> pT(T, 0.0); 98 vector<vector<double>> pYgivenT(T, vector<double>(Y, 0.0)); 99 100 for (int it = 0; it < maxIters; ++it) { 101 // Update p(t) 102 fill(pT.begin(), pT.end(), 0.0); 103 for (size_t x = 0; x < X; ++x) 104 for (size_t t = 0; t < T; ++t) 105 pT[t] += pX[x] * pTgivenX[x][t]; 106 for (size_t t = 0; t < T; ++t) pT[t] = max(pT[t], EPS); 107 108 // Update p(y|t) 109 for (size_t t = 0; t < T; ++t) fill(pYgivenT[t].begin(), pYgivenT[t].end(), 0.0); 110 for (size_t t = 0; t < T; ++t) { 111 for (size_t x = 0; x < X; ++x) { 112 double w = pX[x] * pTgivenX[x][t]; 113 for (size_t y = 0; y < Y; ++y) pYgivenT[t][y] += w * pYgivenX[x][y]; 114 } 115 // Normalize by p(t) 116 for (size_t y = 0; y < Y; ++y) pYgivenT[t][y] /= pT[t]; 117 normalize(pYgivenT[t]); 118 } 119 120 // Update p(t|x) 121 double maxChange = 0.0; 122 for (size_t x = 0; x < X; ++x) { 123 vector<double> newRow(T, 0.0); 124 for (size_t t = 0; t < T; ++t) { 125 double d = kl_div(pYgivenX[x], pYgivenT[t]); 126 double score = log(max(pT[t], EPS)) - beta * d; // work in log-space 127 newRow[t] = exp(score); 128 } 129 normalize(newRow); 130 for (size_t t = 0; t < T; ++t) maxChange = max(maxChange, fabs(newRow[t] - pTgivenX[x][t])); 131 pTgivenX[x] = newRow; 132 } 133 134 if (it % 20 == 0 || it == maxIters - 1) { 135 double IXT = mutual_info_XT(pX, pTgivenX); 136 double ITY = mutual_info_TY(pT, pYgivenT); 137 cout << "Iter " << it << ": I(X;T)=" << IXT << " nats, I(T;Y)=" << ITY << " nats\n"; 138 } 139 140 if (maxChange < 1e-8) break; // converged 141 } 142 143 // Print resulting clusters p(y|t) 144 cout << fixed << setprecision(4); 145 for (size_t t = 0; t < T; ++t) { 146 cout << "Cluster t=" << t << ", p(t)=" << pT[t] << ", p(y|t)=[" << pYgivenT[t][0] << ", " << pYgivenT[t][1] << "]\n"; 147 } 148 149 // Print soft assignments p(t|x) 150 for (size_t x = 0; x < X; ++x) { 151 cout << "x=" << x << " -> "; 152 for (size_t t = 0; t < T; ++t) cout << "p(t=" << t << "|x)=" << pTgivenX[x][t] << (t+1==T?"\n":" "); 153 } 154 155 return 0; 156 } 157
This program solves the discrete Information Bottleneck using Blahut–Arimoto iterations. It initializes a soft encoder p(t|x), alternates updates for the cluster prior p(t), decoder p(y|t), and encoder p(t|x) with the KL-based distortion, and reports I(X;T) and I(T;Y) during optimization. The example clusters four inputs into two task-aware clusters based on their binary label distributions. All computations use natural logs (nats).
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct RNG { 5 mt19937 gen; normal_distribution<double> norm; 6 RNG(unsigned seed=42): gen(seed), norm(0.0,1.0) {} 7 double gauss() { return norm(gen); } 8 }; 9 10 // Sigmoid and stable log-sigmoid helpers 11 inline double sigmoid(double x){ return 1.0/(1.0+exp(-x)); } 12 inline double log1pexp(double x){ if(x>0) return x+log1p(exp(-x)); else return log1p(exp(x)); } 13 14 int main(){ 15 ios::sync_with_stdio(false); 16 cin.tie(nullptr); 17 18 // Synthetic data: x ~ N(0,1), y = 1[x + noise > 0], noise ~ N(0, 0.5^2) 19 RNG rng(123); 20 const int N = 2000; 21 vector<double> X(N); vector<int> Y(N); 22 for(int i=0;i<N;++i){ 23 double x = rng.gauss(); 24 double noise = 0.5 * rng.gauss(); 25 int y = (x + noise > 0.0) ? 1 : 0; 26 X[i]=x; Y[i]=y; 27 } 28 29 // Model parameters: 1D encoder q(z|x)=N(mu(x), sigma^2(x)), with mu=wmu*x+bmu, logvar=l=wlv*x+blv 30 // Classifier: p(y=1|z)=sigmoid(wc*z+bc) 31 double wmu=0.1, bmu=0.0, wlv=-1.0, blv=-1.0; // start with small variance (logvar ~ -1) 32 double wc=0.1, bc=0.0; 33 34 const double beta = 1.0; // bottleneck strength 35 const double lr = 1e-2; // learning rate 36 const int epochs = 200; 37 const int B = 100; // mini-batch size 38 39 vector<int> idx(N); iota(idx.begin(), idx.end(), 0); 40 41 for(int ep=0; ep<epochs; ++ep){ 42 // Shuffle data 43 shuffle(idx.begin(), idx.end(), rng.gen); 44 45 double epoch_loss=0.0, epoch_acc=0.0, epoch_kl=0.0; 46 47 for(int s=0; s<N; s+=B){ 48 int bEnd = min(N, s+B); 49 // Accumulate gradients 50 double gwmu=0, gbmu=0, gwlv=0, gblv=0, gwc=0, gbc=0; 51 double batch_loss=0.0, batch_acc=0.0, batch_kl=0.0; 52 53 for(int k=s; k<bEnd; ++k){ 54 int i = idx[k]; 55 double x = X[i]; int y = Y[i]; 56 57 // Encoder parameters 58 double mu = wmu*x + bmu; 59 double l = wlv*x + blv; // log-variance 60 double sigma2 = exp(l); 61 double sigma = sqrt(max(sigma2, 1e-12)); 62 63 // Reparameterization 64 double eps = rng.gauss(); 65 double z = mu + sigma * eps; 66 67 // Classifier 68 double logit = wc*z + bc; 69 double p1 = sigmoid(logit); 70 71 // Binary cross-entropy: -[ y*log p1 + (1-y)*log(1-p1) ] 72 double ce = - ( y*log(max(p1,1e-12)) + (1-y)*log(max(1.0-p1,1e-12)) ); 73 74 // KL(q(z|x)||N(0,1)) for 1D Gaussian: 0.5*(mu^2 + sigma^2 - log sigma^2 - 1) 75 double kl = 0.5*(mu*mu + sigma2 - l - 1.0); 76 77 double loss = ce + beta * kl; 78 batch_loss += loss; 79 batch_kl += kl; 80 batch_acc += ( (p1>0.5) == (y==1) ) ? 1.0 : 0.0; 81 82 // Gradients 83 double dL_dlogit = p1 - y; // derivative of CE wrt logit 84 // Classifier grads 85 gwc += dL_dlogit * z; 86 gbc += dL_dlogit; 87 // Through z 88 double dL_dz = dL_dlogit * wc; 89 // z = mu + sigma*eps 90 double dL_dmu = dL_dz + beta * mu; // CE part via z + KL part 91 double dL_dsigma = dL_dz * eps; // CE part via z (dz/dsigma=eps) 92 // For KL, using l=log sigma^2 => kl = 0.5*(mu^2 + exp(l) - l - 1) 93 double dKL_dl = 0.5*(exp(l) - 1.0); 94 // sigma = exp(0.5*l) => dsigma/dl = 0.5*sigma 95 double dCE_dl = dL_dsigma * 0.5 * sigma; // only CE flows via sigma 96 double dL_dl = dCE_dl + beta * dKL_dl; 97 98 // Back to affine params 99 gwmu += dL_dmu * x; 100 gbmu += dL_dmu; 101 gwlv += dL_dl * x; 102 gblv += dL_dl; 103 } 104 105 int bs = bEnd - s; 106 // Average gradients 107 gwmu /= bs; gbmu /= bs; gwlv /= bs; gblv /= bs; gwc /= bs; gbc /= bs; 108 batch_loss /= bs; batch_acc /= bs; batch_kl /= bs; 109 110 // SGD updates 111 wmu -= lr * gwmu; bmu -= lr * gbmu; 112 wlv -= lr * gwlv; blv -= lr * gblv; 113 wc -= lr * gwc; bc -= lr * gbc; 114 115 epoch_loss += batch_loss * bs; epoch_acc += batch_acc * bs; epoch_kl += batch_kl * bs; 116 } 117 118 epoch_loss /= N; epoch_acc /= N; epoch_kl /= N; 119 if (ep % 20 == 0 || ep == epochs-1) { 120 // epoch_kl approximates I(X;Z) upper bound (in nats) 121 cout << "Epoch " << ep 122 << ": loss=" << epoch_loss 123 << ", acc=" << epoch_acc 124 << ", E[KL]~I(X;Z)=" << epoch_kl << " nats\n"; 125 } 126 } 127 128 // Report learned parameters 129 cout << fixed << setprecision(4); 130 cout << "wmu=" << wmu << ", bmu=" << bmu << ", wlv=" << wlv << ", blv=" << blv << "\n"; 131 cout << "wc=" << wc << ", bc=" << bc << "\n"; 132 133 return 0; 134 } 135
This program implements a minimal Variational Information Bottleneck for a 1D input. The encoder produces a Gaussian latent z with mean and log-variance linear in x; the classifier predicts y from z via logistic regression. The loss is cross-entropy plus β times the Gaussian KL to a standard normal prior, which upper-bounds I(X;Z). Using the reparameterization trick enables unbiased, low-variance gradients in plain C++. The printed average KL provides a proxy for compression, and accuracy reflects predictive power—tuning β traces a tradeoff.