🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
⚙️AlgorithmIntermediate

Metropolis-Hastings Algorithm

Key Points

  • •
    Metropolis–Hastings is a clever accept/reject method that lets you sample from complex probability distributions using only an unnormalized density.
  • •
    You propose a move using an easier distribution q and accept it with probability α = min(1, π(x′) q(x∣x′)/(π(x)q(x′∣x))).
  • •
    If the proposal is symmetric (q(x'∣x)=q(x∣x')), the acceptance ratio simplifies to α = min(1, π(x′)/π(x)).
  • •
    Using log-densities prevents numerical underflow and makes the algorithm more stable in practice.
  • •
    The method constructs a Markov chain whose stationary distribution is the target π, thanks to detailed balance.
  • •
    Tuning the proposal scale is crucial: too small gives slow exploration; too large gives low acceptance.
  • •
    Burn-in, thinning, and monitoring autocorrelation help turn the raw chain into reliable estimates.
  • •
    Time per step is dominated by evaluating the target (and proposal), giving overall time roughly O(T) for T iterations, but effective sample size depends on mixing.

Prerequisites

  • →Probability densities and mass functions — You must understand what a distribution is, how to evaluate it (possibly up to a constant), and the meaning of support.
  • →Markov chains — MH constructs a Markov chain; concepts like transition kernels, stationary distribution, and ergodicity are essential.
  • →Logarithms and exponential functions — Acceptance probabilities are computed in the log domain to avoid numerical issues.
  • →Random number generation — You need to sample from proposals and uniforms for accept/reject; understanding RNG seeding ensures reproducibility.
  • →Basic calculus and linear algebra (for multivariate targets) — Evaluating multivariate densities and tuning covariance structures often requires this background.
  • →Bayes’ theorem (for Bayesian applications) — Posterior targets are defined by likelihood × prior up to a normalizing constant.

Detailed Explanation

Tap terms for definitions

01Overview

The Metropolis–Hastings (MH) algorithm is a foundational technique in Markov Chain Monte Carlo (MCMC) used to draw samples from probability distributions that are hard to sample from directly. Instead of generating independent samples, MH builds a Markov chain whose long-run behavior (its stationary distribution) matches the target distribution π you care about—often known only up to a constant of proportionality. At each step, the algorithm proposes a new point using a simpler distribution q, then decides whether to accept or reject the move using an acceptance probability chosen to make the chain spend time in regions proportional to π. This accept/reject rule corrects for any bias introduced by the proposal mechanism, allowing even asymmetric proposals to work correctly. Because you only need to evaluate the target density up to a constant, MH is widely used in Bayesian statistics, physics, and optimization-like settings where normalizing constants are unknown or expensive to compute. While conceptually simple, practical performance depends on careful tuning of the proposal distribution and diagnostics to assess convergence and mixing. MH’s flexibility allows it to tackle high-dimensional and multi-modal targets, though alternative MCMC methods may outperform it in some challenging scenarios.

02Intuition & Analogies

Imagine hiking a landscape at night with a headlamp. The landscape’s height at position x represents how plausible that x is under the target distribution π(x)—higher means more likely. You can’t see the entire map, and you don’t know how it’s globally normalized, but you can measure the relative height at any single spot you shine your light on (evaluate π(x) up to a constant). At each step, you suggest a nearby place to move (the proposal q), like taking a stride of a certain length in a random direction. If the new place is higher (more plausible), you gladly move there. If it’s lower, you sometimes still move there, with a chance that depends on how much lower it is. This occasional willingness to go downhill is critical: it keeps you from getting stuck on local peaks and ensures fair exploration. The accept/reject rule is like a fairness auditor—it adjusts your willingness to move based on how you chose the step. If your stepping rule is biased (asymmetric), the auditor compensates so your long-term walk still spends the right fraction of time in each region of the landscape. Over time, your footsteps trace out a path that visits places in proportion to their plausibility under π. Averaging your visited points (or functions of them) then gives accurate estimates of expectations with respect to the target distribution, without ever needing its normalizing constant.

03Formal Definition

Let the target distribution be π(x) on state space X, known up to a constant: π(x) ∝ f(x). Given current state x, propose x' ~ q(x'|x). Define the Metropolis–Hastings acceptance probability as α(x, x') = \min\left(1, \frac{\pi(x')\, q(x\mid x')}{\pi(x)\, q(x'\mid x)}\right). With probability α(x, x'), move to x'; otherwise remain at x. This defines a Markov chain with transition kernel P(x, dy) = q(y∣ x)\, α(x,y)\, dy + r(x)\, δx​(dy), where r(x) = 1 - ∫ q(y∣ x)\, α(x,y)\, dy and δx​ is a point mass at x. The chain satisfies detailed balance with respect to π: for all x=y, π(x) q(y∣ x) α(x,y) = π(y) q(x∣ y) α(y,x). Hence π is stationary: ∫ π(x) P(x, dy) dx = π(dy). When q is symmetric, q(x'∣ x) = q(x∣ x'), the acceptance simplifies to α(x, x') = \min\left(1, \frac{\pi(x')}{\pi(x)}\right). Under mild conditions (irreducibility, aperiodicity), the chain is ergodic: empirical averages N1​ ∑t=1N​ f(Xt​) converge to E_π[f(X)].

04When to Use

Use Metropolis–Hastings when you need samples from a distribution that is hard to sample from directly but is easy to evaluate up to a constant. Common cases include Bayesian posterior distributions where the evidence (normalizing constant) is unknown, models with complex likelihoods, and high-dimensional integrals where quadrature is impossible. Choose MH when you can design a reasonable proposal q: for local exploration, a random-walk proposal is simple and robust; for heavy-tailed or constrained domains (e.g., positive parameters), asymmetric proposals like log-normal or Gamma proposals can be better. MH is also useful as a building block inside larger samplers (e.g., within Gibbs steps) or for discrete targets where neighborhood proposals are natural. Avoid plain random-walk MH when the target is very high-dimensional and strongly correlated—gradient-based samplers (e.g., Hamiltonian Monte Carlo) or adaptive/blocked proposals may mix faster. For multi-modal targets with widely separated modes, consider tempered transitions or multiple chains to improve exploration. Always plan for burn-in, tuning, and convergence diagnostics when using MH in practice.

⚠️Common Mistakes

  • Ignoring proposal asymmetry: Forgetting the q(x|x')/q(x'|x) factor leads to biased sampling. Always include the Hastings correction unless the proposal is provably symmetric.
  • Working in probability instead of log-probability: Multiplying tiny numbers underflows to zero. Compute in the log domain and compare log u with log α to avoid numerical issues.
  • Poor proposal scale: Steps that are too small give high acceptance but slow exploration; too large give low acceptance and frequent rejections. Tune the scale to achieve reasonable acceptance (often 0.2–0.5 for random-walk proposals, problem-dependent).
  • Violating support constraints: Proposing negative values for parameters that must be positive (e.g., variances) leads to invalid evaluations. Use constrained proposals (e.g., log-space random walk) or reflect/reject invalid proposals carefully.
  • Insufficient burn-in and diagnostics: Using early samples before the chain forgets its start can bias estimates. Discard burn-in, check trace plots and autocorrelation, and run multiple chains if possible.
  • Storing every sample: Keeping all T states can be memory-heavy. Either stream estimates online or thin judiciously; better yet, store summaries and periodic checkpoints.
  • Bad seeding or reused RNG streams: For reproducibility and independence across runs, manage seeds and do not reuse overlapping random streams.
  • Not checking for NaN/Inf: Propagating invalid log-probabilities silently breaks the chain. Guard against domain errors and return -inf for impossible states.

Key Formulas

Metropolis–Hastings Acceptance

α(x,x′)=min(1,π(x)q(x′∣x)π(x′)q(x∣x′)​)

Explanation: The probability of accepting a proposed move from x to x' depends on how well x' fits the target and how the proposal treats forward and reverse moves. This corrects for asymmetric proposals to ensure the Markov chain has the desired stationary distribution.

Symmetric Proposal Simplification

α(x,x′)=min(1,π(x)π(x′)​)if q(x′∣x)=q(x∣x′)

Explanation: When the proposal is symmetric (e.g., Gaussian random walk), the q terms cancel. Acceptance depends only on the target density ratio.

MH Transition Kernel

P(x,dy)=q(y∣x)α(x,y)dy+r(x)δx​(dy),r(x)=1−∫q(y∣x)α(x,y)dy

Explanation: The next state is either the proposed y with acceptance α or stays at x with the remaining probability r(x). This kernel defines the Markov chain.

Detailed Balance for MH

π(x)q(y∣x)α(x,y)=π(y)q(x∣y)α(y,x)

Explanation: This equality ensures reversibility and guarantees that π is stationary for the Metropolis–Hastings chain.

Stationarity

∫π(x)P(x,dy)dx=π(dy)

Explanation: Applying the transition kernel to a π−distributed state leaves the distribution unchanged. This is the key correctness property of MH.

Log-Domain Acceptance

logα(x,x′)=min(0,logπ(x′)−logπ(x)+logq(x∣x′)−logq(x′∣x))

Explanation: Computing in the log domain avoids underflow/overflow when densities are very small. You can compare log u with log α instead of forming α directly.

MCMC Law of Large Numbers

μ^​N​=N1​t=1∑N​f(Xt​)a.s.​Eπ​[f(X)]

Explanation: Sample averages along the chain converge almost surely to the true expectation under π, assuming ergodicity. This justifies using MH samples for estimation.

Effective Sample Size

ESS≈1+2∑k=1∞​ρk​N​

Explanation: Autocorrelation reduces effective information. This approximation shows how lag-k autocorrelations ρ_k inflate variance and shrink the equivalent number of independent samples.

Complexity Analysis

Let T be the number of MH iterations, d the dimension, and assume evaluating the unnormalized log-target costs C_π(d). Each iteration performs: (1) one proposal draw from q, (2) evaluation of log π at the proposed point (and sometimes at the current point if not cached), and (3) possibly evaluation of log q terms if the proposal is asymmetric. Thus, the per-iteration time is O(C_π(d) + Cq​(d)). For common choices like Gaussian random-walk proposals, Cq​(d) is O(d). If you cache log π at the current state, each step typically needs only one new π evaluation. Overall runtime is O(T · (C_π(d) + Cq​(d))). Space usage is O(1) if you stream computations (keep only the current state and running summaries), and O(T · d) if you store the full chain. In practice, statistical efficiency matters: due to autocorrelation, the effective sample size (ESS) is smaller than T. If autocorrelation at lag k is ρ_k, the variance of Monte Carlo estimates inflates by approximately (1 + 2∑_{k≥1} ρ_k), so the ESS is roughly N/(1 + 2∑ ρ_k). This means that two algorithms with the same per-iteration cost can differ greatly in effective cost per independent draw depending on mixing. Burn-in further reduces usable samples. Proposal tuning (step size, covariance) and problem structure (e.g., strong correlations) can make C_π dominate, so profiling the target evaluation and using gradients or block updates (when available) can substantially improve overall efficiency.

Code Examples

Random-walk Metropolis for a 1D multimodal target (symmetric proposal)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute log-sum-exp of two log-values safely
5static inline double logsumexp(double a, double b) {
6 if (a > b) return a + log1p(exp(b - a));
7 else return b + log1p(exp(a - b));
8}
9
10// Log-density of Normal(mu, sigma)
11static inline double log_normal(double x, double mu, double sigma) {
12 static const double LOG_SQRT_2PI = 0.5 * log(2.0 * M_PI);
13 double z = (x - mu) / sigma;
14 return -LOG_SQRT_2PI - log(sigma) - 0.5 * z * z; // exact log-pdf
15}
16
17// Unnormalized log target: mixture 0.5*N(0,1) + 0.5*N(5, 0.5)
18// We include proper component normalizing constants so weights are meaningful.
19static inline double log_target(double x) {
20 double a = log(0.5) + log_normal(x, 0.0, 1.0);
21 double b = log(0.5) + log_normal(x, 5.0, 0.5);
22 return logsumexp(a, b); // log of sum of weighted components
23}
24
25struct MHResult { vector<double> samples; double accept_rate; double mean; double var; };
26
27MHResult metropolis_random_walk(
28 double x0, // initial value
29 int iterations, // total iterations
30 int burn_in, // how many to discard
31 int thin, // keep every 'thin'-th sample
32 double proposal_std, // std of N(0, proposal_std^2)
33 uint64_t seed // RNG seed
34) {
35 mt19937_64 rng(seed);
36 normal_distribution<double> norm01(0.0, 1.0);
37 uniform_real_distribution<double> unif(0.0, 1.0);
38
39 vector<double> out;
40 out.reserve(max(0, (iterations - burn_in) / max(1, thin)) + 1);
41
42 double x = x0;
43 double logp_x = log_target(x); // cache current log π(x)
44 int accepted = 0;
45
46 // Random-walk proposal: x' = x + epsilon, epsilon ~ N(0, proposal_std^2)
47 for (int t = 1; t <= iterations; ++t) {
48 double x_prop = x + proposal_std * norm01(rng);
49 double logp_prop = log_target(x_prop);
50
51 // Symmetric proposal => log q terms cancel
52 double log_alpha = min(0.0, logp_prop - logp_x);
53 // Accept/reject in log-domain
54 if (log(unif(rng)) < log_alpha) {
55 x = x_prop;
56 logp_x = logp_prop;
57 ++accepted;
58 }
59
60 // Record after burn-in and per thinning schedule
61 if (t > burn_in && ((t - burn_in) % max(1, thin) == 0)) {
62 out.push_back(x);
63 }
64 }
65
66 // Compute simple summaries
67 double mean = 0.0;
68 for (double v : out) mean += v;
69 mean /= max<size_t>(1, out.size());
70 double var = 0.0;
71 for (double v : out) var += (v - mean) * (v - mean);
72 var /= max<size_t>(1, out.size());
73
74 MHResult res;
75 res.samples = move(out);
76 res.accept_rate = (double)accepted / (double)iterations;
77 res.mean = mean;
78 res.var = var;
79 return res;
80}
81
82int main() {
83 // Settings
84 double x0 = -3.0; // start near one mode
85 int iterations = 20000; // total MH steps
86 int burn_in = 2000; // discard initial steps
87 int thin = 5; // keep every 5th sample
88 double proposal_std = 1.0; // tune for acceptance ~ 0.2-0.5
89 uint64_t seed = 123456789ULL;
90
91 MHResult res = metropolis_random_walk(x0, iterations, burn_in, thin, proposal_std, seed);
92
93 cout.setf(std::ios::fixed); cout << setprecision(4);
94 cout << "Collected samples: " << res.samples.size() << "\n";
95 cout << "Acceptance rate: " << res.accept_rate << "\n";
96 cout << "Sample mean: " << res.mean << "\n";
97 cout << "Sample variance: " << res.var << "\n";
98
99 // Print first few samples
100 cout << "First 10 samples:" << '\n';
101 for (size_t i = 0; i < min<size_t>(10, res.samples.size()); ++i) {
102 cout << res.samples[i] << (i + 1 == 10 ? '\n' : ' ');
103 }
104 return 0;
105}
106

This program samples from a 1D mixture of two Gaussians using a Gaussian random-walk proposal. Because the proposal is symmetric, the acceptance probability reduces to α = min(1, π(x')/π(x)). All calculations are done in the log domain for stability. Burn-in and thinning are supported, and acceptance rate plus simple summaries are reported. You can adjust the proposal_std to trade off acceptance rate and exploration.

Time: O(T) target evaluations (one per iteration), where T = iterations.Space: O(K) to store K kept samples; O(1) extra working memory if streaming.
Metropolis–Hastings with asymmetric log-normal proposal on (0, ∞) (Hastings correction)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Log of unnormalized Gamma(k, theta) density: x^{k-1} e^{-x/theta}, support x>0
5static inline double log_unnorm_gamma(double x, double k, double theta) {
6 if (x <= 0.0) return -INFINITY; // outside support
7 return (k - 1.0) * log(x) - (x / theta); // ignore constants independent of x
8}
9
10struct MHResult { vector<double> samples; double accept_rate; };
11
12MHResult mh_log_normal_proposal(
13 double x0, int iterations, int burn_in, int thin,
14 double k, double theta, // target Gamma shape k and scale theta
15 double prop_sigma, // std of log-space random walk
16 uint64_t seed
17) {
18 mt19937_64 rng(seed);
19 normal_distribution<double> norm01(0.0, 1.0);
20 uniform_real_distribution<double> unif(0.0, 1.0);
21
22 vector<double> out;
23 out.reserve(max(0, (iterations - burn_in) / max(1, thin)) + 1);
24
25 double x = max(1e-6, x0); // ensure positive start
26 double logp_x = log_unnorm_gamma(x, k, theta);
27 int accepted = 0;
28
29 for (int t = 1; t <= iterations; ++t) {
30 // Propose multiplicative step: y = x * exp(sigma * Z), Z ~ N(0,1)
31 double z = norm01(rng);
32 double y = x * exp(prop_sigma * z);
33
34 // Compute log acceptance with Hastings correction
35 double logp_y = log_unnorm_gamma(y, k, theta);
36 // For log-normal proposals: log q(x|y) - log q(y|x) = log(y/x)
37 double log_hastings = log(y) - log(x);
38 double log_alpha = min(0.0, (logp_y - logp_x) + log_hastings);
39
40 if (log(unif(rng)) < log_alpha) {
41 x = y;
42 logp_x = logp_y;
43 ++accepted;
44 }
45
46 if (t > burn_in && ((t - burn_in) % max(1, thin) == 0)) {
47 out.push_back(x);
48 }
49 }
50
51 MHResult res;
52 res.samples = move(out);
53 res.accept_rate = (double)accepted / (double)iterations;
54 return res;
55}
56
57int main() {
58 double x0 = 1.0; // positive start
59 int iterations = 30000;
60 int burn_in = 3000;
61 int thin = 10;
62 double k = 2.0, theta = 1.5; // target Gamma parameters
63 double prop_sigma = 0.5; // std of log-space move; tune for acceptance ~ 0.2-0.5
64 uint64_t seed = 42ULL;
65
66 MHResult res = mh_log_normal_proposal(x0, iterations, burn_in, thin, k, theta, prop_sigma, seed);
67
68 cout.setf(std::ios::fixed); cout << setprecision(4);
69 cout << "Kept samples: " << res.samples.size() << "\n";
70 cout << "Acceptance rate: " << res.accept_rate << "\n";
71 // Report simple summaries
72 double mean = 0.0; for (double v : res.samples) mean += v; mean /= max<size_t>(1, res.samples.size());
73 double var = 0.0; for (double v : res.samples) var += (v - mean) * (v - mean); var /= max<size_t>(1, res.samples.size());
74 cout << "Sample mean: " << mean << "\n";
75 cout << "Sample variance: " << var << "\n";
76
77 // Print first few samples
78 cout << "First 10 samples:\n";
79 for (size_t i = 0; i < min<size_t>(10, res.samples.size()); ++i) cout << res.samples[i] << (i+1==10?'\n':' ');
80 return 0;
81}
82

This example samples a Gamma target on (0, ∞) using a multiplicative log-normal random-walk proposal. Because the proposal is asymmetric, the Hastings correction appears as log q(x|y) − log q(y|x) = log(y/x). Working in log-space preserves numerical stability and naturally enforces the positivity constraint of the target. The program reports acceptance rate and simple summaries.

Time: O(T) target evaluations, plus O(1) work per iteration for proposal and acceptance.Space: O(K) to store K kept samples; O(1) extra memory otherwise.
#metropolis-hastings#mcmc#acceptance ratio#proposal distribution#detailed balance#random-walk#hastings correction#burn-in#thinning#autocorrelation#effective sample size#bayesian inference#log-density#markov chain#sampling