🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
📚TheoryAdvanced

Information Bottleneck

Key Points

  • •
    The Information Bottleneck (IB) principle formalizes the tradeoff between compressing an input X and preserving information about a target Y using the objective minp(t∣x)​ I(X;T) - β I(T;Y).
  • •
    Mutual information I(A;B) measures how much knowing A reduces uncertainty about B, and β controls how aggressively we compress X while keeping predictive power for Y.
  • •
    For discrete variables, the Blahut–Arimoto algorithm efficiently finds soft clusters T that balance compression and prediction using KL divergence as a distortion.
  • •
    IB is closely related to rate–distortion theory with a task-aware distortion d(x,t) = DKL​(p(y∣x)∥ p(y|t)).
  • •
    In practice, variational methods (VIB) optimize a stochastic encoder q_ϕ(z∣x)andpredictorpθ​(y∣z) with a KL regularizer that upper-bounds I(X;Z).
  • •
    The information plane plots I(X;T) versus I(T;Y) to visualize the tradeoff; solutions lie on a convex frontier for optimal β values.
  • •
    Common pitfalls include misestimating mutual information, using too few clusters/latent units, and ignoring numerical stability in iterative updates.
  • •
    C++ implementations can tackle the discrete IB (BA iterations) and small variational models (VIB with reparameterization) for pedagogical experiments.

Prerequisites

  • →Probability distributions and expectations — IB and VIB manipulate joint, marginal, and conditional probabilities and require expectations over random variables.
  • →Entropy, mutual information, and KL divergence — Core quantities in the IB objective and BA updates are defined using entropy and KL.
  • →Optimization and Lagrange multipliers — IB’s tradeoff arises from a constrained optimization that connects to rate–distortion via Lagrangian duality.
  • →Markov chains and Data Processing Inequality — The IB assumes Y - X - T; reasoning about information flow relies on DPI.
  • →Numerical stability in probabilistic computation — Handling zeros, underflow, and normalization is essential in BA iterations and MI calculations.
  • →Gradient-based learning (for VIB) — VIB uses stochastic gradients and reparameterization to optimize the variational objective.
  • →Logistic regression and cross-entropy — The VIB classifier commonly optimizes cross-entropy using logistic or softmax outputs.

Detailed Explanation

Tap terms for definitions

01Overview

The Information Bottleneck (IB) principle is a framework for learning compact yet predictive representations. Given an input random variable X and a task variable Y (such as labels), we seek an intermediate representation T that discards irrelevant details in X while preserving the aspects of X that are useful to predict Y. IB turns this idea into an optimization problem: minimize the mutual information between X and T (which quantifies compression) while maximizing the mutual information between T and Y (which quantifies predictive content). A single tradeoff parameter, beta (\beta), balances these two aims. At \beta = 0 the model cares only about prediction, and as \beta increases it increasingly prefers compressed representations.

For discrete variables, IB yields a soft clustering of X into a limited number of codes T, where each cluster retains as much information about Y as possible. For continuous variables and high-dimensional data, exact mutual information is intractable, so variational approximations such as the Variational Information Bottleneck (VIB) are used. VIB introduces a stochastic encoder that maps X into a latent Z with an information-regularizing KL penalty. IB connects to classical rate–distortion theory by treating task-relevant distortion as the KL divergence between conditional label distributions. The principle underlies modern ideas in representation learning, generalization, and robust compression.

02Intuition & Analogies

Imagine you are explaining a complex movie (X) to a friend who only cares about the plot twist (Y). You must summarize the movie (create T) without telling every scene. If you include too many details, your summary is long (low compression); if you omit too much, your friend won’t understand the twist (low prediction). The Information Bottleneck asks: what is the shortest, most focused summary that still preserves what’s needed to predict the twist?

Another analogy: packing a suitcase. You can’t bring everything (compression), but you must include what you’ll actually need (prediction). The parameter \beta is like the strictness of airline baggage limits—higher \beta forces you to pack less. If you pack too light, you miss essentials; if you overpack, you carry unnecessary weight. IB formalizes this balance by measuring how much your summary T still tells you about Y compared to the original X.

In clustering terms, IB forms groups of inputs that are indistinguishable from Y’s perspective. Two inputs belong in the same cluster if they imply similar predictions about Y. The “distance” between inputs and clusters is not geometric but informational: how different their label distributions are. This is why the KL divergence between p(y|x) and p(y|t) plays the role of distortion. The IB algorithm iteratively refines clusters so that clusters become prototypical for predicting Y while using as few distinct clusters (or as little information about X) as possible.

03Formal Definition

Let X, Y be random variables and T a stochastic representation produced by an encoder p(t|x). The Information Bottleneck objective is \[ minp(t∣x)​ \; I(X;T) - β \, I(T;Y), \] subject to the Markov chain Y - X - T (i.e., T depends on Y only through X). Here, I(A;B) denotes mutual information. For discrete variables, \[ I(A;B) = ∑a,b​ p(a,b) log p(a)p(b)p(a,b)​. \] IB is equivalent (via Lagrangian duality) to a constrained problem: minimize I(X;T) subject to I(T;Y) ≥ I0​, or to a rate–distortion form \[ minp(t∣x)​ \; I(X;T) + β \, E[d(X,T)], d(x,t) = DKL​(p(y|x) \∣p(y∣t)), \] where DKL​ is the Kullback–Leibler divergence. For discrete alphabets, stationary points satisfy the Blahut–Arimoto (BA) style updates: \[ p(t) = ∑x​ p(x)p(t∣x),p(y∣t) = p(t)1​ ∑x​ p(y∣x)p(x)p(t∣x), \] \[ p(t|x) = Z(x)p(t)exp(−βDKL​(p(y∣x)∥p(y∣t)))​, \] with normalizer Z(x) ensuring ∑t​ p(t|x)=1. The solution traces a convex information plane curve relating I(X;T) and I(T;Y) as β varies.

04When to Use

  • Task-aware compression: When you want to compress inputs while maintaining performance on a specific predictive task, IB provides a principled target.
  • Clustering for prediction: If you need soft clusters that are maximally informative about labels, the discrete IB (via BA iterations) is appropriate.
  • Representation learning: IB/VIB yields representations that discard nuisance factors, improving robustness and generalization under distribution shift and noise.
  • Model selection and regularization: The \beta parameter acts like an information-based regularizer; tuning \beta trades model complexity for predictive sufficiency.
  • Interpretable summaries: IB can extract small, human-readable codebooks T that summarize how inputs relate to outcomes.
  • Communication and coding: In settings akin to source coding with side information, IB connects to rate–distortion and can guide system design when distortion is task-relevance rather than Euclidean error.
  • Low-data regimes: By enforcing parsimony (small I(X;T)), IB can reduce overfitting when data is scarce, especially with variational bounds that are computationally tractable.

⚠️Common Mistakes

  • Confusing compression with dimensionality: Lower-dimensional T does not guarantee lower I(X;T); stochastic encoders and noise often do the real compressing.
  • Misestimating mutual information: Naive plug-in estimates are biased; prefer exact sums for discrete distributions or variational/nearest-neighbor estimators for continuous data.
  • Ignoring the Markov constraint: The IB theory assumes Y - X - T; learning pipelines that leak Y directly into T violate assumptions and distort the tradeoff.
  • Numerical instability in BA updates: Failing to normalize p(t|x) properly or allowing zeros in probabilities leads to NaNs; always add small epsilons and renormalize.
  • Over-regularization with large \beta: Excessive compression can destroy predictive content; monitor I(T;Y), accuracy, or validation loss when sweeping \beta.
  • Using hard clusters prematurely: IB solutions are soft assignments; forcing hard clusters early can trap the algorithm in poor local optima.
  • Comparing nats and bits: Mutual information depends on the log base; be consistent (natural log vs. log base 2) when reporting and interpreting values.

Key Formulas

Information Bottleneck Objective

p(t∣x)min​I(X;T)−βI(T;Y)

Explanation: This is the core tradeoff between compression (I(X;T)) and prediction (I(T;Y)), balanced by the nonnegative parameter β. Larger β favors preserving more about Y at the cost of less compression.

Mutual Information (Discrete)

I(A;B)=a,b∑​p(a,b)logp(a)p(b)p(a,b)​

Explanation: Mutual information sums over all pairs of outcomes and measures how much the joint deviates from the product of marginals. Zero MI means independence.

KL Divergence (Discrete)

DKL​(p∥q)=i∑​pi​logqi​pi​​

Explanation: KL divergence quantifies how well q approximates p, with 0 meaning identical distributions. In IB it acts as a task-informed distortion between conditional label distributions.

Task-Aware Distortion

d(x,t)=DKL​(p(y∣x)∥p(y∣t))

Explanation: Two inputs x that induce similar label distributions should map to the same t. The KL divergence captures this notion of predictive similarity.

BA Marginal and Decoder Updates

p(t)=x∑​p(x)p(t∣x),p(y∣t)=p(t)1​x∑​p(y∣x)p(x)p(t∣x)

Explanation: Given an encoder p(t|x), we update the cluster prior p(t) and the decoder p(y|t) by averaging conditional label distributions over inputs assigned to t.

BA Encoder Update

p(t∣x)=Z(x)p(t)exp(−βDKL​(p(y∣x)∥p(y∣t)))​

Explanation: Inputs x prefer clusters t whose predicted label distribution p(y|t) is close (in KL) to p(y|x), weighted by the cluster size p(t). Z(x) normalizes across t.

MI via Encoder

I(X;T)=x,t∑​p(x)p(t∣x)logp(t)p(t∣x)​

Explanation: When T is generated by p(t|x), mutual information simplifies to an expectation of the log-likelihood ratio p(t|x)/p(t). This is convenient in discrete IB.

VIB Objective

LVIB​=Ep(x,y)​Eq(z∣x)​[−logp(y∣z)]+βEp(x)​DKL​(q(z∣x)∥p(z))

Explanation: The variational relaxation replaces the intractable MI with a KL regularizer to a prior p(z). The first term is prediction loss; the second controls compression.

Gaussian KL

DKL​(N(μ,Σ)∥N(0,I))=21​(tr(Σ)+μ⊤μ−k−logdetΣ)

Explanation: For a k-dimensional Gaussian encoder with diagonal covariance, the KL term has a closed form, enabling efficient optimization in VIB.

Data Processing Inequality

I(T;Y)≤I(X;Y)

Explanation: Because Y - X - T is a Markov chain, processing X into T cannot create information about Y. IB solutions lie under this bound in the information plane.

Complexity Analysis

For the discrete Information Bottleneck solved via the Blahut–Arimoto (BA) iterations, each update requires computing KL divergences between p(y∣x)andp(y∣t) for all pairs (x, t). If we denote ∣X∣, ∣Y∣, and ∣T∣ as the alphabet sizes, then computing all DKL​ values costs O(∣X∣∣T∣∣Y∣) time per iteration. Updating the decoder p(y∣t)aggregatesoverXandcostsO(∣X|∣T∣∣Y∣), while updating the encoder p(t∣x)andclusterpriorp(t)costsO(∣X|∣T∣). Thus, each full BA iteration is O(∣X∣∣T∣∣Y∣), with memory O(∣X∣∣T∣ + ∣T∣∣Y∣ + ∣X∣∣Y∣) to store p(t∣x),p(y∣t), and p(y∣x).Convergencetypicallyoccursintenstohundredsofiterationsforsmallproblemsbutcanslownearphasetransitionsasβvaries.ComputingmutualinformationexactlyfromdiscretetablescostsO(∣A|∣B∣) for I(A;B) when joint and marginals are available. Numerical stability necessitates careful handling of zeros with small epsilons, which can add constant-time checks but does not change asymptotic complexity. For a small Variational Information Bottleneck (VIB) model with a one-dimensional Gaussian latent, a linear encoder, and a logistic classifier, forward and backward passes are O(d) per sample with d the input dimension (here d=1). Using mini-batches of size B over N samples and E epochs yields O(N E d) total time, with constant-factor overhead for random sampling (reparameterization) and exponentials/logistics. Memory is O(N) for the dataset plus O(1) for parameters, since the model has a few scalars. In higher dimensions, time scales linearly with latent and input dimensions under diagonal-covariance encoders. Variational MI surrogates avoid combinatorial sums but can require more epochs for convergence compared to discrete BA on small alphabets.

Code Examples

Discrete Information Bottleneck via Blahut–Arimoto (soft clustering)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Small epsilon to avoid log(0)
5const double EPS = 1e-12;
6
7// Normalize a vector to sum to 1
8void normalize(vector<double>& v) {
9 double s = 0.0; for (double x : v) s += x;
10 if (s <= 0) { double u = 1.0 / v.size(); for (double &x : v) x = u; return; }
11 for (double &x : v) x = max(x, EPS);
12 s = accumulate(v.begin(), v.end(), 0.0);
13 for (double &x : v) x /= s;
14}
15
16// KL divergence D_KL(p||q) for discrete distributions (both must be normalized)
17double kl_div(const vector<double>& p, const vector<double>& q) {
18 double d = 0.0;
19 for (size_t i = 0; i < p.size(); ++i) {
20 double pi = max(p[i], EPS);
21 double qi = max(q[i], EPS);
22 d += pi * (log(pi) - log(qi)); // natural log => nats
23 }
24 return d;
25}
26
27// Compute mutual information I(X;T) given p(x) and p(t|x)
28double mutual_info_XT(const vector<double>& pX, const vector<vector<double>>& pTgivenX) {
29 size_t X = pX.size();
30 size_t T = pTgivenX[0].size();
31 vector<double> pT(T, 0.0);
32 for (size_t x = 0; x < X; ++x) {
33 for (size_t t = 0; t < T; ++t) pT[t] += pX[x] * pTgivenX[x][t];
34 }
35 double I = 0.0;
36 for (size_t x = 0; x < X; ++x) {
37 for (size_t t = 0; t < T; ++t) {
38 double joint = pX[x] * pTgivenX[x][t];
39 if (joint < EPS) continue;
40 I += joint * (log(pTgivenX[x][t]) - log(max(pT[t], EPS)));
41 }
42 }
43 return I; // in nats
44}
45
46// Compute mutual information I(T;Y) given p(t) and p(y|t)
47double mutual_info_TY(const vector<double>& pT, const vector<vector<double>>& pYgivenT) {
48 size_t T = pT.size();
49 size_t Y = pYgivenT[0].size();
50 vector<double> pY(Y, 0.0);
51 for (size_t t = 0; t < T; ++t) {
52 for (size_t y = 0; y < Y; ++y) pY[y] += pT[t] * pYgivenT[t][y];
53 }
54 double I = 0.0;
55 for (size_t t = 0; t < T; ++t) {
56 for (size_t y = 0; y < Y; ++y) {
57 double joint = pT[t] * pYgivenT[t][y];
58 if (joint < EPS) continue;
59 I += joint * (log(pYgivenT[t][y]) - log(max(pY[y], EPS)));
60 }
61 }
62 return I; // in nats
63}
64
65int main() {
66 ios::sync_with_stdio(false);
67 cin.tie(nullptr);
68
69 // Example: small discrete IB problem
70 // X has 4 values, Y has 2 values (binary labels), T has K clusters
71 size_t X = 4, Y = 2, T = 2; // try T=2 clusters
72 double beta = 5.0; // tradeoff parameter (higher => more predictive)
73 int maxIters = 200;
74
75 // Define p(x)
76 vector<double> pX = {0.25, 0.25, 0.25, 0.25};
77
78 // Define p(y|x): rows sum to 1
79 // Suppose x=0,1 imply mostly y=0; x=2,3 imply mostly y=1
80 vector<vector<double>> pYgivenX = {
81 {0.9, 0.1}, // x=0
82 {0.8, 0.2}, // x=1
83 {0.2, 0.8}, // x=2
84 {0.1, 0.9} // x=3
85 };
86
87 // Initialize p(t|x) randomly and normalize rows
88 mt19937 rng(42);
89 uniform_real_distribution<double> uni(0.0, 1.0);
90 vector<vector<double>> pTgivenX(X, vector<double>(T, 0.0));
91 for (size_t x = 0; x < X; ++x) {
92 for (size_t t = 0; t < T; ++t) pTgivenX[x][t] = uni(rng) + EPS;
93 normalize(pTgivenX[x]);
94 }
95
96 // Allocate p(t) and p(y|t)
97 vector<double> pT(T, 0.0);
98 vector<vector<double>> pYgivenT(T, vector<double>(Y, 0.0));
99
100 for (int it = 0; it < maxIters; ++it) {
101 // Update p(t)
102 fill(pT.begin(), pT.end(), 0.0);
103 for (size_t x = 0; x < X; ++x)
104 for (size_t t = 0; t < T; ++t)
105 pT[t] += pX[x] * pTgivenX[x][t];
106 for (size_t t = 0; t < T; ++t) pT[t] = max(pT[t], EPS);
107
108 // Update p(y|t)
109 for (size_t t = 0; t < T; ++t) fill(pYgivenT[t].begin(), pYgivenT[t].end(), 0.0);
110 for (size_t t = 0; t < T; ++t) {
111 for (size_t x = 0; x < X; ++x) {
112 double w = pX[x] * pTgivenX[x][t];
113 for (size_t y = 0; y < Y; ++y) pYgivenT[t][y] += w * pYgivenX[x][y];
114 }
115 // Normalize by p(t)
116 for (size_t y = 0; y < Y; ++y) pYgivenT[t][y] /= pT[t];
117 normalize(pYgivenT[t]);
118 }
119
120 // Update p(t|x)
121 double maxChange = 0.0;
122 for (size_t x = 0; x < X; ++x) {
123 vector<double> newRow(T, 0.0);
124 for (size_t t = 0; t < T; ++t) {
125 double d = kl_div(pYgivenX[x], pYgivenT[t]);
126 double score = log(max(pT[t], EPS)) - beta * d; // work in log-space
127 newRow[t] = exp(score);
128 }
129 normalize(newRow);
130 for (size_t t = 0; t < T; ++t) maxChange = max(maxChange, fabs(newRow[t] - pTgivenX[x][t]));
131 pTgivenX[x] = newRow;
132 }
133
134 if (it % 20 == 0 || it == maxIters - 1) {
135 double IXT = mutual_info_XT(pX, pTgivenX);
136 double ITY = mutual_info_TY(pT, pYgivenT);
137 cout << "Iter " << it << ": I(X;T)=" << IXT << " nats, I(T;Y)=" << ITY << " nats\n";
138 }
139
140 if (maxChange < 1e-8) break; // converged
141 }
142
143 // Print resulting clusters p(y|t)
144 cout << fixed << setprecision(4);
145 for (size_t t = 0; t < T; ++t) {
146 cout << "Cluster t=" << t << ", p(t)=" << pT[t] << ", p(y|t)=[" << pYgivenT[t][0] << ", " << pYgivenT[t][1] << "]\n";
147 }
148
149 // Print soft assignments p(t|x)
150 for (size_t x = 0; x < X; ++x) {
151 cout << "x=" << x << " -> ";
152 for (size_t t = 0; t < T; ++t) cout << "p(t=" << t << "|x)=" << pTgivenX[x][t] << (t+1==T?"\n":" ");
153 }
154
155 return 0;
156}
157

This program solves the discrete Information Bottleneck using Blahut–Arimoto iterations. It initializes a soft encoder p(t|x), alternates updates for the cluster prior p(t), decoder p(y|t), and encoder p(t|x) with the KL-based distortion, and reports I(X;T) and I(T;Y) during optimization. The example clusters four inputs into two task-aware clusters based on their binary label distributions. All computations use natural logs (nats).

Time: O(|X||T||Y|) per iteration, typically tens to hundreds of iterations.Space: O(|X||T| + |T||Y| + |X||Y|).
Variational Information Bottleneck (1D Gaussian encoder + logistic classifier)
1#include <bits/stdc++.h>
2using namespace std;
3
4struct RNG {
5 mt19937 gen; normal_distribution<double> norm;
6 RNG(unsigned seed=42): gen(seed), norm(0.0,1.0) {}
7 double gauss() { return norm(gen); }
8};
9
10// Sigmoid and stable log-sigmoid helpers
11inline double sigmoid(double x){ return 1.0/(1.0+exp(-x)); }
12inline double log1pexp(double x){ if(x>0) return x+log1p(exp(-x)); else return log1p(exp(x)); }
13
14int main(){
15 ios::sync_with_stdio(false);
16 cin.tie(nullptr);
17
18 // Synthetic data: x ~ N(0,1), y = 1[x + noise > 0], noise ~ N(0, 0.5^2)
19 RNG rng(123);
20 const int N = 2000;
21 vector<double> X(N); vector<int> Y(N);
22 for(int i=0;i<N;++i){
23 double x = rng.gauss();
24 double noise = 0.5 * rng.gauss();
25 int y = (x + noise > 0.0) ? 1 : 0;
26 X[i]=x; Y[i]=y;
27 }
28
29 // Model parameters: 1D encoder q(z|x)=N(mu(x), sigma^2(x)), with mu=wmu*x+bmu, logvar=l=wlv*x+blv
30 // Classifier: p(y=1|z)=sigmoid(wc*z+bc)
31 double wmu=0.1, bmu=0.0, wlv=-1.0, blv=-1.0; // start with small variance (logvar ~ -1)
32 double wc=0.1, bc=0.0;
33
34 const double beta = 1.0; // bottleneck strength
35 const double lr = 1e-2; // learning rate
36 const int epochs = 200;
37 const int B = 100; // mini-batch size
38
39 vector<int> idx(N); iota(idx.begin(), idx.end(), 0);
40
41 for(int ep=0; ep<epochs; ++ep){
42 // Shuffle data
43 shuffle(idx.begin(), idx.end(), rng.gen);
44
45 double epoch_loss=0.0, epoch_acc=0.0, epoch_kl=0.0;
46
47 for(int s=0; s<N; s+=B){
48 int bEnd = min(N, s+B);
49 // Accumulate gradients
50 double gwmu=0, gbmu=0, gwlv=0, gblv=0, gwc=0, gbc=0;
51 double batch_loss=0.0, batch_acc=0.0, batch_kl=0.0;
52
53 for(int k=s; k<bEnd; ++k){
54 int i = idx[k];
55 double x = X[i]; int y = Y[i];
56
57 // Encoder parameters
58 double mu = wmu*x + bmu;
59 double l = wlv*x + blv; // log-variance
60 double sigma2 = exp(l);
61 double sigma = sqrt(max(sigma2, 1e-12));
62
63 // Reparameterization
64 double eps = rng.gauss();
65 double z = mu + sigma * eps;
66
67 // Classifier
68 double logit = wc*z + bc;
69 double p1 = sigmoid(logit);
70
71 // Binary cross-entropy: -[ y*log p1 + (1-y)*log(1-p1) ]
72 double ce = - ( y*log(max(p1,1e-12)) + (1-y)*log(max(1.0-p1,1e-12)) );
73
74 // KL(q(z|x)||N(0,1)) for 1D Gaussian: 0.5*(mu^2 + sigma^2 - log sigma^2 - 1)
75 double kl = 0.5*(mu*mu + sigma2 - l - 1.0);
76
77 double loss = ce + beta * kl;
78 batch_loss += loss;
79 batch_kl += kl;
80 batch_acc += ( (p1>0.5) == (y==1) ) ? 1.0 : 0.0;
81
82 // Gradients
83 double dL_dlogit = p1 - y; // derivative of CE wrt logit
84 // Classifier grads
85 gwc += dL_dlogit * z;
86 gbc += dL_dlogit;
87 // Through z
88 double dL_dz = dL_dlogit * wc;
89 // z = mu + sigma*eps
90 double dL_dmu = dL_dz + beta * mu; // CE part via z + KL part
91 double dL_dsigma = dL_dz * eps; // CE part via z (dz/dsigma=eps)
92 // For KL, using l=log sigma^2 => kl = 0.5*(mu^2 + exp(l) - l - 1)
93 double dKL_dl = 0.5*(exp(l) - 1.0);
94 // sigma = exp(0.5*l) => dsigma/dl = 0.5*sigma
95 double dCE_dl = dL_dsigma * 0.5 * sigma; // only CE flows via sigma
96 double dL_dl = dCE_dl + beta * dKL_dl;
97
98 // Back to affine params
99 gwmu += dL_dmu * x;
100 gbmu += dL_dmu;
101 gwlv += dL_dl * x;
102 gblv += dL_dl;
103 }
104
105 int bs = bEnd - s;
106 // Average gradients
107 gwmu /= bs; gbmu /= bs; gwlv /= bs; gblv /= bs; gwc /= bs; gbc /= bs;
108 batch_loss /= bs; batch_acc /= bs; batch_kl /= bs;
109
110 // SGD updates
111 wmu -= lr * gwmu; bmu -= lr * gbmu;
112 wlv -= lr * gwlv; blv -= lr * gblv;
113 wc -= lr * gwc; bc -= lr * gbc;
114
115 epoch_loss += batch_loss * bs; epoch_acc += batch_acc * bs; epoch_kl += batch_kl * bs;
116 }
117
118 epoch_loss /= N; epoch_acc /= N; epoch_kl /= N;
119 if (ep % 20 == 0 || ep == epochs-1) {
120 // epoch_kl approximates I(X;Z) upper bound (in nats)
121 cout << "Epoch " << ep
122 << ": loss=" << epoch_loss
123 << ", acc=" << epoch_acc
124 << ", E[KL]~I(X;Z)=" << epoch_kl << " nats\n";
125 }
126 }
127
128 // Report learned parameters
129 cout << fixed << setprecision(4);
130 cout << "wmu=" << wmu << ", bmu=" << bmu << ", wlv=" << wlv << ", blv=" << blv << "\n";
131 cout << "wc=" << wc << ", bc=" << bc << "\n";
132
133 return 0;
134}
135

This program implements a minimal Variational Information Bottleneck for a 1D input. The encoder produces a Gaussian latent z with mean and log-variance linear in x; the classifier predicts y from z via logistic regression. The loss is cross-entropy plus β times the Gaussian KL to a standard normal prior, which upper-bounds I(X;Z). Using the reparameterization trick enables unbiased, low-variance gradients in plain C++. The printed average KL provides a proxy for compression, and accuracy reflects predictive power—tuning β traces a tradeoff.

Time: O(N · epochs) for 1D with constant work per sample; scales linearly with input/latent dimension under diagonal covariance.Space: O(N) for data storage and O(1) for model parameters.
#information bottleneck#mutual information#kl divergence#rate-distortion#blahut-arimoto#soft clustering#variational information bottleneck#representation learning#reparameterization trick#information plane#compression#predictive coding#regularization#gaussian encoder#mi estimation