Normalizing Flows
Key Points
- •Normalizing flows transform a simple base distribution (like a standard Gaussian) into a complex target distribution using a chain of invertible functions.
- •The change-of-variables formula links densities through the Jacobian determinant of the inverse transform: (x) = ((x)) .
- •Designs like affine coupling layers (e.g., RealNVP) ensure easy inversion and cheap log-determinant computation by making the Jacobian triangular.
- •Training maximizes the log-likelihood, which becomes the base log-density plus a sum of per-layer log-determinants.
- •Efficient implementations avoid O() Jacobian determinants by using structures with O(d) or O() cost per layer (triangular, LU-parameterized, or 1×1 convolutions).
- •Sampling is done by drawing z from the base and pushing it forward through the flow; density evaluation pulls x back through the inverse.
- •Numerical stability requires summing log-scales instead of multiplying scales, clamping scales, and avoiding direct determinants.
- •C++ implementations can model flows with plain linear algebra: coupling layers for nonlinearity and triangular or LU layers for fast log-dets.
Prerequisites
- →Multivariable calculus (Jacobian and chain rule) — Flows rely on Jacobians and the change-of-variables theorem to relate densities.
- →Linear algebra (matrices, determinants, LU) — Efficient log-determinant computation and invertibility constraints depend on matrix properties.
- →Probability densities and Gaussian distributions — Flows start from and evaluate against simple base densities like Gaussians.
- →Numerical stability and log-sum-exp tricks — Stable accumulation of log-determinants and scaling factors is essential.
- →Neural networks basics (MLPs, activations) — Scale/shift functions in coupling layers are typically small neural networks.
- →C++ programming and basic linear algebra implementation — Implementing flow layers requires vector/matrix operations and careful memory handling.
Detailed Explanation
Tap terms for definitions01Overview
Normalizing flows are probabilistic models that build complex probability distributions by warping a simple, known distribution through a sequence of invertible transformations. Imagine starting with a standard multivariate Gaussian (easy to sample from and to evaluate) and then repeatedly bending and stretching space using functions that can be undone exactly. Because each step is invertible and differentiable, we can track how volumes (and therefore densities) change via the Jacobian determinant of each transformation. The change-of-variables theorem tells us how probability densities transform under such mappings, making exact likelihood computation possible. A typical flow composes many simple, carefully designed layers to achieve both expressivity and computational efficiency. Popular designs include affine coupling layers (RealNVP), invertible 1×1 convolutions (Glow), and triangular/LU-parameterized linear transforms. Crucially, these designs keep the Jacobian determinant easy to compute—ideally as a sum over diagonal entries—so the overall log-likelihood reduces to a base log-density plus the sum of per-layer log-determinants. Flows are attractive because they support both fast sampling (forward pass) and exact density evaluation (inverse pass), unlike many generative models that only do one efficiently.
02Intuition & Analogies
Picture molding a lump of clay. Initially it’s a perfect sphere—simple, symmetric, and easy to describe. Now, with each gentle press and twist of your hands, you form ridges, cavities, and intricate shapes. If each press is carefully controlled so you can reverse it exactly, you could reconstruct the original sphere by applying the reverse presses in the reverse order. Normalizing flows use this idea for probability distributions. We start with a simple “sphere-like” distribution (a standard Gaussian). Each flow layer is like a reversible press that deforms space. Where you compress, probability density increases (points are packed tighter); where you stretch, density decreases. The amount of compression or stretching at a point is measured by the Jacobian determinant of the transformation—the multiplier that tells you how a tiny volume changes. By chaining many reversible presses (layers), you can sculpt extremely detailed, multimodal distributions from a simple base. Affine coupling layers are like pressing only half the clay while holding the other half fixed; you decide how to press the second half based on the first half. Because only part moves at a time, the math remains simple and you always know exactly how much volume changed (you just add up some log-scales). LU and triangular layers are like using tools that only push along certain directions, keeping the “volume change tracker” easy to compute. In the end, to say how probable a sculpture point x is, you just unpress it back to the sphere, account for all the presses’ volume changes, and read off the sphere’s density at the recovered point.
03Formal Definition
04When to Use
Use normalizing flows when you need a generative model that supports both fast sampling and exact likelihoods. They excel in density estimation tasks, anomaly detection (low likelihood indicates outliers), and as flexible priors in Bayesian models where you must evaluate densities precisely. Flows are also effective for modeling continuous-valued data such as audio waveforms, tabular data, and image pixels (after dequantization). If you can exploit structure (e.g., triangular Jacobians, LU factorization, or channel-wise couplings), you can scale to high dimensions while keeping computation tractable. They are particularly appealing when you want interpretability of likelihoods (unlike some implicit models), or need likelihood-based training objectives (e.g., minimizing negative log-likelihood or bits-per-dimension in images). Consider them when re-parameterizable sampling is important (e.g., variational inference) and when invertibility can be guaranteed by design. However, if your data are discrete without a suitable continuous relaxation, or if you require extremely global dependencies that are difficult for coupling flows to capture without very deep stacks, you may prefer autoregressive models or diffusion models.
⚠️Common Mistakes
- Forgetting the absolute value in |det J|. The determinant can be negative; omitting |·| gives wrong densities. Always work with log|det J| to maintain numerical stability.
- Computing a full Jacobian and its determinant naively. This is O(d^3) per layer and defeats the purpose. Use structures with triangular, block-triangular, or LU forms so log-dets are O(d) or O(d^2).
- Ignoring invertibility constraints. For planar or radial flows, parameters must satisfy conditions to ensure invertibility. For affine coupling, only the overall layer must be invertible; the s and t networks need not be.
- Exponentiating large scales directly. Using exp(s) can overflow. Clamp s (e.g., via tanh and a multiplier) and accumulate log-dets as sums, never by forming determinants explicitly.
- Mixing up forward and inverse. Sampling uses forward (z → x), while likelihood evaluation uses inverse (x → z). Make sure the sign of the log-determinant matches the chosen direction.
- Poor mixing between dimensions. Using the same mask repeatedly in coupling layers limits expressivity. Alternate masks and add permutation or invertible linear mixing between layers.
- Forgetting data preprocessing. For images, dequantize and rescale; for continuous data, standardize. Flows are sensitive to scale.
- Not handling batch shapes carefully in implementations. Mismatched dimensions or incorrectly broadcasted scale/shift vectors can silently corrupt results.
Key Formulas
Change of Variables
Explanation: Relates the density at x to the base density evaluated at the inverse image, scaled by how volumes change under the inverse map. This is the core identity enabling exact likelihoods in flows.
Composed Flow Log-Likelihood
Explanation: For a composition of K invertible layers, the total log-likelihood is the base log-density plus the sum of per-layer log-determinants along the inverse path.
Multivariate Gaussian Log-Density
Explanation: This gives the log-probability of a Gaussian vector. With = I and = 0, it simplifies to a constant minus half the squared norm of z.
Affine Coupling Layer
Explanation: Keeping part of the vector fixed yields a triangular Jacobian. The log-determinant is the sum of the scale outputs, making it cheap to compute and perfectly invertible.
Matrix Determinant Lemma
Explanation: Used by planar flows where )^. The determinant reduces to a simple scalar expression, enabling fast log-determinant computation.
LU Log-Determinant
Explanation: For an LU-parameterized invertible linear map, the log-determinant is the sum of logs of U’s diagonal, independent of x. This underpins efficient invertible 1×1 convolutions.
Negative Log-Likelihood Objective
Explanation: Training flows by maximum likelihood minimizes the total negative log-likelihood over data. Gradients backpropagate through inverses and log-determinants.
Inverse Function Theorem (Local)
Explanation: A nonzero Jacobian determinant guarantees a local inverse. Flow layers must maintain this condition everywhere to ensure valid densities.
Complexity Analysis
Code Examples
1 #include <iostream> 2 #include <vector> 3 #include <random> 4 #include <cmath> 5 #include <numeric> 6 #include <algorithm> 7 8 // Utility functions for simple vector/matrix ops 9 using Vec = std::vector<double>; 10 using Mat = std::vector<std::vector<double>>; 11 12 static double dot(const Vec &a, const Vec &b) { 13 double s = 0.0; for (size_t i = 0; i < a.size(); ++i) s += a[i]*b[i]; return s; 14 } 15 16 static Vec matvec(const Mat &W, const Vec &x) { 17 size_t m = W.size(); 18 size_t n = x.size(); 19 Vec y(m, 0.0); 20 for (size_t i = 0; i < m; ++i) { 21 double s = 0.0; 22 for (size_t j = 0; j < n; ++j) s += W[i][j] * x[j]; 23 y[i] = s; 24 } 25 return y; 26 } 27 28 static Vec add(const Vec &a, const Vec &b) { 29 Vec y(a.size()); 30 for (size_t i = 0; i < a.size(); ++i) y[i] = a[i] + b[i]; 31 return y; 32 } 33 34 static Vec tanh_vec(const Vec &x) { 35 Vec y(x.size()); 36 for (size_t i = 0; i < x.size(); ++i) y[i] = std::tanh(x[i]); 37 return y; 38 } 39 40 static Vec exp_vec(const Vec &x) { 41 Vec y(x.size()); 42 for (size_t i = 0; i < x.size(); ++i) y[i] = std::exp(x[i]); 43 return y; 44 } 45 46 static Vec hadamard(const Vec &a, const Vec &b) { 47 Vec y(a.size()); 48 for (size_t i = 0; i < a.size(); ++i) y[i] = a[i] * b[i]; 49 return y; 50 } 51 52 // A tiny linear layer: y = W x + b 53 struct Linear { 54 Mat W; // (out x in) 55 Vec b; // (out) 56 Linear() {} 57 Linear(size_t out_dim, size_t in_dim, std::mt19937 &rng) { 58 std::normal_distribution<double> N(0.0, 0.1); 59 W.assign(out_dim, Vec(in_dim)); 60 b.assign(out_dim, 0.0); 61 for (size_t i = 0; i < out_dim; ++i) { 62 for (size_t j = 0; j < in_dim; ++j) W[i][j] = N(rng); 63 b[i] = N(rng); 64 } 65 } 66 Vec forward(const Vec &x) const { return add(matvec(W, x), b); } 67 }; 68 69 // Affine coupling layer with a binary mask: mask[i]=1 means x[i] is kept (a), 0 means transformed (b) 70 struct AffineCoupling { 71 std::vector<int> mask; // size d 72 // s(x_a) and t(x_a) parameterized as small linear layers for demo 73 Linear s_lin; 74 Linear t_lin; 75 size_t d, da, db; 76 77 AffineCoupling() : d(0), da(0), db(0) {} 78 AffineCoupling(const std::vector<int> &mask_, std::mt19937 &rng) : mask(mask_) { 79 d = mask.size(); 80 da = std::accumulate(mask.begin(), mask.end(), 0); 81 db = d - da; 82 s_lin = Linear(db, da, rng); 83 t_lin = Linear(db, da, rng); 84 } 85 86 // Splits x into x_a (kept) and x_b (transformed) according to mask 87 void split(const Vec &x, Vec &xa, Vec &xb) const { 88 xa.clear(); xb.clear(); 89 for (size_t i = 0; i < d; ++i) { 90 if (mask[i]) xa.push_back(x[i]); else xb.push_back(x[i]); 91 } 92 } 93 94 // Merge (ya from kept, yb from transformed) back into full vector 95 Vec merge(const Vec &ya, const Vec &yb) const { 96 Vec y(d); 97 size_t ia = 0, ib = 0; 98 for (size_t i = 0; i < d; ++i) { 99 if (mask[i]) { y[i] = ya[ia++]; } else { y[i] = yb[ib++]; } 100 } 101 return y; 102 } 103 104 // Forward: y = f(x); returns y and log|det J_f(x)| 105 Vec forward(const Vec &x, double &logdet) const { 106 Vec xa, xb; split(x, xa, xb); 107 // Compute s and t from xa 108 Vec s_raw = s_lin.forward(xa); 109 // Clamp scale with tanh to avoid extreme exp; scale_factor can tune capacity 110 const double scale_factor = 1.5; 111 for (double &v : s_raw) v = scale_factor * std::tanh(v); 112 Vec t = t_lin.forward(xa); 113 // Transform xb -> yb 114 Vec exp_s = exp_vec(s_raw); 115 Vec yb = add(hadamard(xb, exp_s), t); 116 Vec y = merge(xa, yb); 117 // log|det J| = sum(s) 118 logdet = std::accumulate(s_raw.begin(), s_raw.end(), 0.0); 119 return y; 120 } 121 122 // Inverse: x = f^{-1}(y); returns x and log|det J_{f^{-1}}(y)| = -sum(s) 123 Vec inverse(const Vec &y, double &logdet_inv) const { 124 Vec ya, yb; split(y, ya, yb); 125 Vec s_raw = s_lin.forward(ya); 126 const double scale_factor = 1.5; 127 for (double &v : s_raw) v = scale_factor * std::tanh(v); 128 Vec t = t_lin.forward(ya); 129 // Invert: xb = (yb - t) * exp(-s) 130 Vec xb(yb.size()); 131 for (size_t i = 0; i < yb.size(); ++i) xb[i] = (yb[i] - t[i]) * std::exp(-s_raw[i]); 132 Vec x = merge(ya, xb); 133 logdet_inv = -std::accumulate(s_raw.begin(), s_raw.end(), 0.0); 134 return x; 135 } 136 }; 137 138 // Simple flow: a stack of affine coupling layers 139 struct Flow { 140 std::vector<AffineCoupling> layers; 141 size_t d; 142 Flow(size_t d_) : d(d_) {} 143 144 // Forward pass: z -> x (sampling). Returns x and total log|det J_f(z)| 145 Vec forward(const Vec &z, double &logdet) const { 146 Vec h = z; logdet = 0.0; 147 for (const auto &L : layers) { 148 double ld = 0.0; h = L.forward(h, ld); logdet += ld; 149 } 150 return h; 151 } 152 153 // Inverse pass: x -> z (likelihood). Returns z and total log|det J_{f^{-1}}(x)| 154 Vec inverse(const Vec &x, double &logdet_inv) const { 155 Vec h = x; logdet_inv = 0.0; 156 for (int i = (int)layers.size() - 1; i >= 0; --i) { 157 double ld = 0.0; h = layers[i].inverse(h, ld); logdet_inv += ld; 158 } 159 return h; 160 } 161 }; 162 163 // Base distribution: standard Gaussian N(0, I) 164 double log_prob_standard_normal(const Vec &z) { 165 const double LOG2PI = std::log(2.0 * std::acos(-1)); 166 double quad = 0.0; for (double v : z) quad += v*v; 167 return -0.5 * (z.size() * LOG2PI + quad); 168 } 169 170 Vec sample_standard_normal(size_t d, std::mt19937 &rng) { 171 std::normal_distribution<double> N(0.0, 1.0); 172 Vec z(d); for (size_t i = 0; i < d; ++i) z[i] = N(rng); return z; 173 } 174 175 int main() { 176 std::mt19937 rng(42); 177 const size_t d = 4; // dimension 178 179 // Build a flow with two coupling layers and complementary masks 180 Flow flow(d); 181 std::vector<int> mask1 = {1,1,0,0}; 182 std::vector<int> mask2 = {0,0,1,1}; 183 flow.layers.emplace_back(mask1, rng); 184 flow.layers.emplace_back(mask2, rng); 185 186 // Example: compute log-likelihood of a data point x 187 Vec x = {0.5, -1.0, 0.2, 1.2}; 188 double logdet_inv = 0.0; 189 Vec z = flow.inverse(x, logdet_inv); // x -> z 190 double logpz = log_prob_standard_normal(z); 191 double logpx = logpz + logdet_inv; // change-of-variables 192 193 std::cout << "z = ["; for (size_t i = 0; i < d; ++i) std::cout << z[i] << (i+1<d?", ":"]\n"); 194 std::cout << "log p(z) = " << logpz << "\n"; 195 std::cout << "sum log|det J_inv| = " << logdet_inv << "\n"; 196 std::cout << "log p(x) = " << logpx << "\n\n"; 197 198 // Example: sample from the model by pushing forward z -> x 199 Vec z_samp = sample_standard_normal(d, rng); 200 double logdet_fwd = 0.0; 201 Vec x_samp = flow.forward(z_samp, logdet_fwd); 202 std::cout << "sampled x = ["; for (size_t i = 0; i < d; ++i) std::cout << x_samp[i] << (i+1<d?", ":"]\n"); 203 std::cout << "sum log|det J_fwd| = " << logdet_fwd << "\n"; 204 205 return 0; 206 } 207
This program implements a minimal RealNVP-style affine coupling layer in C++. The layer keeps some coordinates fixed and uses them to compute per-dimension scale and shift for the remaining coordinates. The Jacobian is block-triangular, so the log-determinant is the sum of scale outputs. The Flow stacks two such layers with complementary masks, making the overall transform expressive and invertible. The code demonstrates both likelihood evaluation (x → z with sum of inverse log-determinants) and sampling (z → x with sum of forward log-determinants), using a standard Gaussian base.
1 #include <iostream> 2 #include <vector> 3 #include <random> 4 #include <cmath> 5 #include <numeric> 6 7 using Vec = std::vector<double>; 8 using Mat = std::vector<std::vector<double>>; 9 10 struct TriangularLinear { 11 size_t d; 12 Mat L; // lower-triangular with ones on diagonal 13 Mat U; // upper-triangular with positive diagonal 14 15 TriangularLinear(size_t d_, std::mt19937 &rng) : d(d_) { 16 std::normal_distribution<double> N(0.0, 0.2); 17 L.assign(d, Vec(d, 0.0)); 18 U.assign(d, Vec(d, 0.0)); 19 for (size_t i = 0; i < d; ++i) { 20 for (size_t j = 0; j < d; ++j) { 21 if (i > j) L[i][j] = N(rng); // strictly lower 22 if (i < j) U[i][j] = N(rng); // strictly upper 23 } 24 L[i][i] = 1.0; // unit diag 25 U[i][i] = std::exp(std::tanh(N(rng))); // positive diag via exp(tanh(.)) 26 } 27 } 28 29 // y = L*(U*x) 30 Vec forward(const Vec &x) const { 31 Vec y(d, 0.0); 32 // temp = U*x (upper-triangular matvec) 33 Vec temp(d, 0.0); 34 for (int i = (int)d - 1; i >= 0; --i) { 35 double s = 0.0; 36 for (size_t j = i; j < d; ++j) s += U[i][j] * x[j]; 37 temp[i] = s; 38 } 39 // y = L*temp (lower-triangular matvec) 40 for (size_t i = 0; i < d; ++i) { 41 double s = 0.0; 42 for (size_t j = 0; j <= i; ++j) s += L[i][j] * temp[j]; 43 y[i] = s; 44 } 45 return y; 46 } 47 48 // x = (L*U)^{-1} y: solve L z = y (forward-substitution), then U x = z (back-substitution) 49 Vec inverse(const Vec &y) const { 50 // Solve L z = y 51 Vec z(d, 0.0); 52 for (size_t i = 0; i < d; ++i) { 53 double s = y[i]; 54 for (size_t j = 0; j < i; ++j) s -= L[i][j] * z[j]; 55 // L[i][i] = 1.0 56 z[i] = s; 57 } 58 // Solve U x = z 59 Vec x(d, 0.0); 60 for (int i = (int)d - 1; i >= 0; --i) { 61 double s = z[i]; 62 for (size_t j = i + 1; j < d; ++j) s -= U[i][j] * x[j]; 63 x[i] = s / U[i][i]; 64 } 65 return x; 66 } 67 68 // log|det(L*U)| = log|det L| + log|det U| = sum log diag(U) (since det L = 1) 69 double logabsdet() const { 70 double s = 0.0; for (size_t i = 0; i < d; ++i) s += std::log(U[i][i]); return s; 71 } 72 }; 73 74 // Base: standard Gaussian 75 double log_prob_standard_normal(const Vec &z) { 76 const double LOG2PI = std::log(2.0 * std::acos(-1)); 77 double quad = 0.0; for (double v : z) quad += v*v; 78 return -0.5 * (z.size() * LOG2PI + quad); 79 } 80 81 int main() { 82 std::mt19937 rng(7); 83 size_t d = 5; 84 TriangularLinear layer(d, rng); 85 86 // Likelihood of a data point x under the linear flow x = W z 87 Vec x = {0.1, -0.3, 1.0, 0.7, -0.2}; 88 Vec z = layer.inverse(x); 89 double logpz = log_prob_standard_normal(z); 90 double logdet_inv = -layer.logabsdet(); // inverse Jacobian determinant 91 double logpx = logpz + logdet_inv; 92 93 std::cout << "log p(z) = " << logpz << "\n"; 94 std::cout << "log|det J_inv| = " << logdet_inv << "\n"; 95 std::cout << "log p(x) = " << logpx << "\n"; 96 97 // Sampling: z -> x, logdet forward is +log|det| 98 Vec z_samp(d, 0.0); 99 std::normal_distribution<double> N(0.0, 1.0); 100 for (size_t i = 0; i < d; ++i) z_samp[i] = N(rng); 101 Vec x_samp = layer.forward(z_samp); 102 std::cout << "sample x[0] = " << x_samp[0] << " (of " << d << ")\n"; 103 104 return 0; 105 } 106
This example builds an invertible linear flow parameterized as a product of a unit-lower-triangular L and an upper-triangular U with positive diagonal. The forward map is y = L(Ux); inversion uses forward- then back-substitution. The log-determinant is just the sum of logs of U’s diagonal—computed in O(d). Although purely linear flows are limited in expressivity, this layer is a key building block (and mirrors LU/triangular tricks used in Glow for efficient channel mixing).