šŸŽ“How I Study AIHISA
šŸ“–Read
šŸ“„PapersšŸ“°BlogsšŸŽ¬Courses
šŸ’”Learn
šŸ›¤ļøPathsšŸ“šTopicsšŸ’”ConceptsšŸŽ“Shorts
šŸŽÆPractice
ā±ļøCoach🧩Problems🧠ThinkingšŸŽÆPrompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
āš™ļøAlgorithmIntermediate

Markov Chain Monte Carlo (MCMC)

Key Points

  • •
    MCMC builds a random walk (a Markov chain) whose long-run visiting frequency matches your target distribution, even when the target is only known up to a constant.
  • •
    The Metropolis–Hastings (MH) algorithm accepts or rejects proposed moves so that detailed balance holds and the target becomes stationary.
  • •
    Gibbs sampling updates one coordinate at a time from its conditional distribution, which can be simpler to draw from than the joint distribution.
  • •
    You can compute expectations by averaging function values along the chain; normalization constants cancel out, so unnormalized densities are fine.
  • •
    Good mixing requires well-tuned proposals (step sizes/scales); poor tuning leads to high autocorrelation and slow convergence.
  • •
    Always work in log space for probabilities to avoid numerical underflow, and include proposal terms for asymmetric proposals.
  • •
    Use diagnostics like trace plots, autocorrelation, and effective sample size (ESS) to check convergence and efficiency.
  • •
    Per iteration cost is usually cheap, but the number of iterations needed can grow quickly with dimension and target complexity.

Prerequisites

  • →Basic probability and random variables — Understanding distributions, expectations, and conditional probability is essential to define targets and conditionals for Gibbs.
  • →Markov chains — MCMC relies on stationarity, irreducibility, and aperiodicity to guarantee convergence.
  • →Calculus and multivariate Gaussians — Evaluating log densities and deriving conditionals (e.g., for the bivariate normal) requires these tools.
  • →Bayes’ theorem — Many MCMC applications target Bayesian posteriors known up to a normalizing constant.
  • →Numerical stability (log-sum-exp) — Working in log space prevents underflow when evaluating products or mixtures.
  • →C++ random number generation (std::mt19937, distributions) — Implementing proposals and conditional draws requires correct use of RNGs.
  • →Data transformations and Jacobians — Sampling in transformed spaces requires adjusting densities to maintain correctness.
  • →Basic statistics and covariance — Interpreting sample summaries, autocorrelation, and ESS depends on these concepts.

Detailed Explanation

Tap terms for definitions

01Overview

Markov Chain Monte Carlo (MCMC) is a family of algorithms for drawing samples from complex probability distributions—especially ones you can evaluate up to a constant but cannot sample from directly. The core idea is to construct a Markov chain whose long-run behavior (its stationary distribution) matches your target distribution. Once the chain has warmed up (burn-in), the states you visit can be treated as dependent samples from the target. By averaging functions over these samples, you approximate expectations like means, variances, and integrals required in Bayesian inference and statistical physics.

Why is this useful? Many real problems involve high-dimensional distributions with intractable normalizing constants. MCMC sidesteps exact normalization by using only ratios of densities. The two most common approaches are Metropolis–Hastings (MH), which proposes a move and accepts it with a carefully chosen probability, and Gibbs sampling, which updates one variable at a time from its conditional distribution. With proper conditions (irreducibility, aperiodicity), these chains converge to the target distribution.

Example at a glance: Suppose your target is a mixture of two Gaussians (a ā€œtwo-hillā€ landscape). Metropolis–Hastings proposes small moves left or right; if the new position is in a higher-density region, you likely accept; if not, you might still accept with some probability. Over time, the chain visits each hill in proportion to its probability mass, letting you estimate the mixture’s mean or other features.

02Intuition & Analogies

Hook: Imagine exploring a dark, hilly landscape with a flashlight, trying to map where people tend to hang out. You can’t see the whole map at once, but you can tell whether a spot is more or less crowded than where you are now.

Concept: MCMC is like walking through that landscape so that the time you spend in each area matches how crowded it is. You decide your next step using only local comparisons of ā€œcrowdednessā€ (probability density), not the whole map. In Metropolis–Hastings, you suggest a new spot (proposal). If it seems more crowded, you usually move there. If it’s less crowded, you might still move there, but less often. This balanced behavior ensures you don’t get stuck and that, in the long run, your footsteps represent the true crowd distribution. In Gibbs sampling, instead of jumping anywhere, you adjust one coordinate at a time using the exact rule for that coordinate given the others—like moving strictly north/south, then east/west, according to local road signs.

Example: Suppose your target distribution is two neighborhoods: one downtown and one in the suburbs. Using a random-walk MH, you take small steps. If you step toward downtown when it’s more crowded, you likely stay. If you wander into a quiet alley, you might backtrack. Over hours of walking, you’ll spend time downtown and in the suburbs in the correct proportions. If you use Gibbs sampling in a grid-like city, you move north–south according to how crowded that street is given your current east–west position, then switch to east–west. After many cycles, your path’s heatmap reflects the real population density.

03Formal Definition

Let (Xt​)_{t≄ 0} be a Markov chain on state space X with transition kernel K(x, A). A distribution Ļ€ on X is stationary if Ļ€(A) = ∫X​ Ļ€(x) K(x, A) \, dx for all measurable sets A. MCMC constructs K so that a target distribution Ļ€ (often known only up to a constant) is stationary. Metropolis–Hastings: Given current state x, propose y ∼ q(y∣ x). Accept y with probability α(x,y) = min\!\left(1, \frac{\pi(y)\, q(x\mid y)}{\pi(x)\, q(y\mid x)}\right); otherwise, stay at x. This yields a kernel satisfying detailed balance: Ļ€(x)K(x,y) = Ļ€(y)K(y,x), implying stationarity of Ļ€. When q is symmetric (e.g., random-walk Gaussian), the q terms cancel. Gibbs sampling: Suppose Ļ€(x1​,…,xd​) has tractable full conditionals Ļ€(xi​ ∣ xāˆ’i​). A Gibbs sweep updates coordinates sequentially (or in blocks): X_i(t+1) ∼ Ļ€(ā‹… ∣ Xāˆ’i(t+1:t)​). Under mild conditions (irreducibility, aperiodicity), the chain is ergodic: time averages converge to expectations under Ļ€. Estimation: For integrable f, the MCMC estimator μ^​n​ = n1​ āˆ‘t=1n​ f(Xt​) satisfies a law of large numbers: μ^​n​ → E_Ļ€[f(X)] as n → āˆž. Variance is inflated by autocorrelation; an effective sample size quantifies information loss due to dependencies.

04When to Use

Use MCMC when you need samples or expectations from complex, high-dimensional distributions that you can evaluate up to a constant but cannot sample from directly. Classic scenarios include:

  • Bayesian inference: posterior distributions with intractable normalizers (e.g., hierarchical models, mixture models, logistic regression with priors).
  • Physics and ML: Boltzmann distributions, energy-based models where only an unnormalized energy E(x) is available.
  • Probabilistic integration: computing \mathbb{E}_\pi[f(X)] for expensive black-box likelihoods.
  • Models with convenient conditionals: if full conditionals are simple (e.g., conjugate), Gibbs sampling can be very efficient.

Avoid or be cautious when:

  • Low dimensions with available direct samplers or quadrature; MCMC overhead is unnecessary.
  • Severe multimodality with poor proposals; chains can get stuck in one mode—consider tempered transitions or multiple chains.
  • Strong correlations or ill-scaled parameters; reparameterize, precondition, or use adaptive proposals.
  • You need independent samples very quickly; MCMC samples are dependent and may have high autocorrelation.

Practical cues: If you can code the log of the target density and its gradients are hard or unavailable, MH (random-walk or independence) is often a good starting point. If each coordinate’s conditional is available, prefer Gibbs or Metropolis-within-Gibbs.

āš ļøCommon Mistakes

  • Ignoring proposal asymmetry: For non-symmetric proposals, forgetting the q-terms in the MH acceptance ratio biases the chain. Always use \alpha(x,y) = min(1, [\pi(y) q(x\mid y)]/[\pi(x) q(y\mid x)]).
  • Not using log space: Multiplying small probabilities underflows to zero. Compute with log densities and use log-sum-exp for mixtures.
  • Poor tuning of step size/scale: Too small leads to high acceptance but slow exploration (high autocorrelation). Too large yields very low acceptance and many repeats. Tune to moderate acceptance (about 0.2–0.5 for random-walk MH; in high dimensions, ~0.234 is a common heuristic).
  • No diagnostics: Failing to check trace plots, autocorrelation, R-hat, or ESS can hide non-convergence or poor mixing. Run multiple chains from dispersed starts when possible.
  • Insufficient burn-in: Early samples may reflect the start, not the target. Discard an initial portion after checking stability of summaries.
  • Over-thinning: Thinning rarely increases statistical efficiency; it can reduce storage but throws away information. Prefer keeping all post–burn-in samples and using ESS.
  • Ignoring constraints and transformations: If parameters are constrained (e.g., positive), either propose in a transformed space (log, logit) and include Jacobians when needed, or reflect proposals at boundaries.
  • Bad random seeding or reusing RNGs incorrectly: Non-random or repeated seeds and shared RNGs across threads can produce artifacts. Initialize RNGs carefully and consider independent streams for parallel chains.

Key Formulas

Stationarity condition

∫X​π(x)K(x,A)dx=Ļ€(A)

Explanation: A distribution is stationary if, after one Markov transition, the probability of landing in any set A equals its original probability. This invariance ensures the chain’s long-run behavior matches the target.

Detailed balance

Ļ€(x)K(x,y)=Ļ€(y)K(y,x)

Explanation: If this symmetry holds for all x and y, then π is stationary for K. Many MCMC algorithms enforce detailed balance to guarantee correctness.

MH acceptance probability

α(x,y)=min(1,Ļ€(x)q(y∣x)Ļ€(y)q(x∣y)​)

Explanation: This is the acceptance probability for a move from x to y with proposal q. It corrects the proposal to target π, allowing unnormalized π because the unknown constant cancels in the ratio.

Random-walk proposal

qRWM​(y∣x)=N(y;x,σ2Id​)

Explanation: A common symmetric proposal centered at the current state. Symmetry implies q(y|x)=q(x|y), simplifying the MH ratio to a target-density ratio.

MCMC estimator and LLN

μ^​n​=n1​t=1āˆ‘n​f(Xt​)andμ^​n​a.s.​Eπ​[f(X)]

Explanation: Average function values along the chain to estimate expectations. Under ergodicity, these time averages converge almost surely to the true expectation.

Integrated autocorrelation time and ESS

Ļ„int​=1+2k=1āˆ‘āˆžā€‹Ļk​,ESSā‰ˆĻ„int​n​

Explanation: Autocorrelation inflates variance of MCMC averages. The integrated autocorrelation time summarizes dependence and yields an effective sample size smaller than n.

Log-sum-exp

log(i=1āˆ‘m​eai​)=māˆ—+log(i=1āˆ‘m​eaiā€‹āˆ’māˆ—),māˆ—=imax​ai​

Explanation: A numerically stable way to sum exponentials in log space, crucial for mixture targets or sums of likelihood terms in MH.

Roberts–Gelman–Gilks heuristic

σRWMā€‹ā‰ˆd​2.38​

Explanation: For high-dimensional Gaussian targets with random-walk MH, scaling the proposal standard deviation as 2.38/√d often yields near-optimal acceptance around 0.234.

Gibbs full conditional

Ļ€(xiā€‹āˆ£xāˆ’i​)āˆĻ€(x1​,…,xd​)

Explanation: In Gibbs sampling you draw from the conditional distribution of one coordinate given the rest, which is proportional to the joint density viewed as a function of that coordinate.

Complexity Analysis

Per-iteration time complexity in MCMC is typically dominated by evaluating the (log) target density and sampling from the proposal or conditionals. If the target’s log-density costs O(C(d)) for d-dimensional x, then one Metropolis–Hastings (MH) step is O(C(d)) plus the cost of proposing from q (usually O(d) for Gaussian proposals). In Gibbs sampling, a full sweep over all coordinates draws d conditionals; when each conditional is simple (e.g., Normal), a sweep is O(d). When log-densities include N data points (e.g., Bayesian models), C(d) often scales as O(Nd) or O(N) depending on structure; thus the per-iteration cost is O(N) in many applications. Space complexity is O(1) to run the chain (you only need the current state and a few scalars), and O(Md) if you store M post–burn-in samples for later analysis. In practice you may compute running summaries to reduce storage. Convergence complexity—how many iterations are needed—is problem-dependent and not captured by big-O alone. Mixing time can grow quickly with dimension, strong correlations, or multimodality. Random-walk MH can suffer from high autocorrelation unless the proposal is well scaled (roughly O(d) steps to decorrelate along each axis). Blocked updates, reparameterization, or adaptive proposals reduce this burden. For the provided code: (1) the 1D MH on a Gaussian mixture has O(1) per step for log-density and proposal; overall O(T) time for T iterations and O(M) space to store M kept samples. (2) The Gibbs sampler for a bivariate normal performs two Normal draws per iteration, also O(1) per step; overall O(T) time and O(M) space. Always include the cost of burn-in and potential thinning when estimating total runtime.

Code Examples

Metropolis–Hastings for a 1D Gaussian Mixture (random-walk proposal)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute log N(x; mu, sigma^2) up to the exact constant (includes -0.5*log(2*pi))
5double log_normal_pdf(double x, double mu, double sigma) {
6 static const double LOG_SQRT_2PI = 0.5 * log(2.0 * M_PI);
7 double z = (x - mu) / sigma;
8 return -LOG_SQRT_2PI - log(sigma) - 0.5 * z * z;
9}
10
11// Stable log-sum-exp for two terms
12inline double logsumexp2(double a, double b) {
13 double m = max(a, b);
14 return m + log(exp(a - m) + exp(b - m));
15}
16
17// Target: mixture 0.3*N(-3,1^2) + 0.7*N(2,0.5^2)
18double log_target(double x) {
19 double a = log(0.3) + log_normal_pdf(x, -3.0, 1.0);
20 double b = log(0.7) + log_normal_pdf(x, 2.0, 0.5);
21 return logsumexp2(a, b); // exact log density of the mixture
22}
23
24int main() {
25 ios::sync_with_stdio(false);
26 cin.tie(nullptr);
27
28 // RNG setup (fixed seed for reproducibility)
29 std::mt19937_64 rng(42);
30 std::normal_distribution<double> std_normal(0.0, 1.0);
31 std::uniform_real_distribution<double> unif01(0.0, 1.0);
32
33 // MH parameters
34 const int total_iters = 50000; // total iterations including burn-in
35 const int burn_in = 5000; // discard initial steps
36 const int thin = 10; // keep every 'thin'-th sample after burn-in
37 const double proposal_sigma = 1.0; // random-walk step size
38
39 // Initialize chain
40 double x = 0.0; // starting point
41 double logp = log_target(x);
42
43 int accepted = 0;
44 vector<double> samples; samples.reserve((total_iters - burn_in) / max(1, thin));
45
46 for (int t = 1; t <= total_iters; ++t) {
47 // Propose y = x + Normal(0, proposal_sigma^2)
48 double y = x + proposal_sigma * std_normal(rng);
49 double logp_y = log_target(y);
50
51 // Acceptance probability (symmetric proposal, so q cancels)
52 double log_alpha = logp_y - logp; // accept with prob min(1, exp(log_alpha))
53 bool accept = false;
54 if (log_alpha >= 0.0) {
55 accept = true; // always accept upward moves
56 } else {
57 double u = unif01(rng);
58 accept = (log(u) < log_alpha);
59 }
60
61 if (accept) {
62 x = y;
63 logp = logp_y;
64 ++accepted;
65 }
66
67 // Record sample after burn-in with thinning
68 if (t > burn_in && ((t - burn_in) % thin == 0)) {
69 samples.push_back(x);
70 }
71 }
72
73 // Basic summaries: sample mean and variance
74 double mean = 0.0;
75 for (double v : samples) mean += v;
76 mean /= samples.size();
77
78 double var = 0.0;
79 for (double v : samples) var += (v - mean) * (v - mean);
80 var /= (samples.size() - 1);
81
82 double acc_rate = static_cast<double>(accepted) / total_iters;
83
84 cout.setf(std::ios::fixed); cout << setprecision(6);
85 cout << "Kept samples: " << samples.size() << "\n";
86 cout << "Acceptance rate: " << acc_rate << "\n";
87 cout << "Sample mean: " << mean << "\n";
88 cout << "Sample variance: " << var << "\n";
89
90 // Print a few samples to inspect multimodality
91 cout << "First 10 kept samples:" << "\n";
92 for (size_t i = 0; i < min<size_t>(10, samples.size()); ++i) {
93 cout << samples[i] << (i + 1 == min<size_t>(10, samples.size()) ? '\n' : ' ');
94 }
95
96 return 0;
97}
98

This program implements a random-walk Metropolis–Hastings sampler for a 1D mixture of two Gaussians. The target is evaluated in log space, using a log-sum-exp trick to stabilize the mixture density. At each step, we propose a Normal perturbation of the current state and accept with probability min(1, exp(logp_new āˆ’ logp_old)). After burn-in and thinning, we report the acceptance rate and basic summaries. Because the proposal is symmetric, the proposal terms cancel in the MH ratio.

Time: O(T) where T is the number of MH iterations (each step uses O(1) work in 1D).Space: O(M) to store M kept samples; O(1) otherwise.
Gibbs Sampling for a Correlated Bivariate Normal
1#include <bits/stdc++.h>
2using namespace std;
3
4// Target: (X, Y) ~ N(0, 0) with Cov = [[1, rho], [rho, 1]]
5// Full conditionals: X|Y=y ~ N(rho*y, 1 - rho^2), Y|X=x ~ N(rho*x, 1 - rho^2)
6
7int main() {
8 ios::sync_with_stdio(false);
9 cin.tie(nullptr);
10
11 double rho = 0.8; // correlation
12 if (rho <= -1.0 || rho >= 1.0) {
13 cerr << "rho must be in (-1,1)" << '\n';
14 return 1;
15 }
16
17 double var_cond = 1.0 - rho * rho;
18 double sd_cond = sqrt(var_cond);
19
20 // RNG
21 std::mt19937_64 rng(123);
22 std::normal_distribution<double> std_normal(0.0, 1.0);
23
24 // Gibbs parameters
25 const int total_iters = 50000;
26 const int burn_in = 5000;
27 const int thin = 10;
28
29 // Initialize
30 double x = 0.0, y = 0.0;
31
32 vector<pair<double,double>> samples; samples.reserve((total_iters - burn_in) / max(1, thin));
33
34 for (int t = 1; t <= total_iters; ++t) {
35 // Sample X | Y=y
36 double mean_x = rho * y;
37 x = mean_x + sd_cond * std_normal(rng);
38
39 // Sample Y | X=x
40 double mean_y = rho * x;
41 y = mean_y + sd_cond * std_normal(rng);
42
43 if (t > burn_in && ((t - burn_in) % thin == 0)) {
44 samples.emplace_back(x, y);
45 }
46 }
47
48 // Compute sample means and covariance
49 double mx = 0.0, my = 0.0;
50 for (auto &p : samples) { mx += p.first; my += p.second; }
51 mx /= samples.size();
52 my /= samples.size();
53
54 double sxx = 0.0, syy = 0.0, sxy = 0.0;
55 for (auto &p : samples) {
56 double dx = p.first - mx;
57 double dy = p.second - my;
58 sxx += dx * dx;
59 syy += dy * dy;
60 sxy += dx * dy;
61 }
62 sxx /= (samples.size() - 1);
63 syy /= (samples.size() - 1);
64 sxy /= (samples.size() - 1);
65
66 cout.setf(std::ios::fixed); cout << setprecision(6);
67 cout << "Kept samples: " << samples.size() << "\n";
68 cout << "Sample mean (x, y): (" << mx << ", " << my << ")\n";
69 cout << "Sample var(x): " << sxx << ", var(y): " << syy << "\n";
70 cout << "Sample cov(x,y): " << sxy << " (target cov: " << rho << ")\n";
71
72 // Print a few samples
73 cout << "First 5 kept samples (x, y):\n";
74 for (size_t i = 0; i < min<size_t>(5, samples.size()); ++i) {
75 cout << samples[i].first << " " << samples[i].second << "\n";
76 }
77
78 return 0;
79}
80

This program performs Gibbs sampling for a bivariate normal with correlation rho. Because the full conditionals are Normal with known means and variances, each Gibbs step is just drawing from two univariate Normals. After burn-in and thinning, the empirical covariance should be close to rho, demonstrating correct sampling from the target.

Time: O(T) where each iteration performs O(1) sampling and arithmetic.Space: O(M) to store M kept (x, y) samples; O(1) otherwise.
#mcmc#metropolis-hastings#gibbs sampling#markov chain#stationary distribution#proposal distribution#acceptance ratio#autocorrelation#effective sample size#burn-in#thinning#bayesian inference#random-walk#detailed balance#target distribution