Multivariate Gaussian Distribution
Key Points
- •A multivariate Gaussian (normal) distribution models a vector of real-valued variables with a bell-shaped probability hill in many dimensions.
- •It is defined by a mean vector that sets the center and a covariance matrix that sets the shape, size, and orientation of the probability ellipsoid.
- •Its probability density function uses the determinant and inverse (or Cholesky factor) of the covariance, so numerically stable linear algebra is essential.
- •Distances are measured by the Mahalanobis distance, which rescales and rotates space to account for correlations between variables.
- •Linear transformations of a multivariate Gaussian remain Gaussian, and marginals and conditionals are also Gaussian with closed-form parameters.
- •Maximum likelihood estimation of the mean and covariance has simple formulas, but the covariance must be positive definite; regularization often helps.
- •In code, avoid explicitly inverting the covariance; use Cholesky decomposition to compute log-densities, sample, and solve systems robustly.
- •Computationally, precompute a Cholesky factor once in O() and then evaluate many log-densities or draw samples in O() each.
Prerequisites
- →Linear algebra (vectors, matrices, transpose, matrix multiplication) — The PDF and all computations rely on matrix operations, eigenvalues, and determinants.
- →Probability and random variables — Understanding distributions, expectation, and independence is necessary to interpret μ and Σ.
- →Variance and covariance — The covariance matrix encodes variances and covariances; their meaning drives model interpretation.
- →Numerical stability and conditioning — Stable computation of log-densities and sampling requires avoiding explicit matrix inverses.
- →Cholesky decomposition — Core tool for factoring Σ to compute Mahalanobis distances, log-determinants, and sampling.
- →Random number generation — Sampling needs standard normal draws and reproducible RNGs.
Detailed Explanation
Tap terms for definitions01Overview
The multivariate Gaussian distribution is a cornerstone of probability and statistics for modeling continuous, real-valued vectors. Think of it as a higher-dimensional extension of the familiar bell curve. Instead of a single mean and variance, it uses a mean vector to specify the center of mass and a covariance matrix to capture spread and correlation among components. Correlation matters: if two dimensions tend to increase together, the probability mass stretches along that shared direction. This distribution is popular because it is mathematically elegant and stable under many operations: linear transformations of a Gaussian remain Gaussian, and the marginals (subsets of coordinates) and conditionals (given others) are also Gaussian with tractable formulas.
In applications, multivariate Gaussians approximate sensor noise, model feature vectors in computer vision, underlie Kalman filters in tracking, and appear in Bayesian inference, PCA, and Gaussian process models. Computationally, the covariance matrix must be symmetric positive definite; this ensures the distribution is well-defined. Implementation-wise, computing probabilities and drawing samples efficiently and stably requires matrix decompositions—especially the Cholesky factorization—rather than direct matrix inversion. The result is a model that is both interpretable (through means, variances, and correlations) and practical for a wide range of engineering and scientific tasks.
02Intuition & Analogies
Picture a pile of sand poured onto a flat floor: in one dimension it forms a bell; in two dimensions it forms a smooth, mound-like hill; in more dimensions, it becomes a higher-dimensional hill you can’t see but can describe mathematically. The peak of the hill is at the mean vector μ, which says where the center of the mound sits. The shape of the hill is not necessarily circular; it can be stretched and tilted. That stretching and tilting are encoded by the covariance matrix Σ. If two coordinates tend to rise and fall together, the hill becomes elongated along that diagonal direction. If they vary independently, the hill is more circular (or spherical in higher dimensions).
A helpful mental model is to start with a spherical cloud of points (like a ball-shaped fog) where each direction has the same variance and no correlations. Apply a stretch-and-rotate operation (an affine transform): stretch more in some directions (larger variance), less in others, and rotate to align with preferred axes. The result is an ellipsoid-shaped cloud—the multivariate Gaussian. Distances inside this cloud are not best measured by plain Euclidean distance; instead, you measure how many standard deviations away a point is after undoing the stretch and rotation. That special distance is the Mahalanobis distance. If you rescale and rotate the space just right (by Σ^{-1/2}), the ellipsoid becomes a unit sphere and the distribution looks like a standard normal again.
In everyday terms: if you predict height and weight together, knowing someone is taller than average suggests they might also be heavier than average—those variables correlate. The multivariate Gaussian captures this idea by letting probability ‘lean’ along the joint trend rather than treating features independently.
03Formal Definition
04When to Use
Use a multivariate Gaussian when you need to model real-valued vectors that cluster around a central mean and exhibit elliptical, correlated spread. Typical scenarios include sensor fusion and tracking (Kalman filters assume Gaussian process and measurement noise), feature modeling in vision and audio (where covariance captures feature correlations), anomaly detection (large Mahalanobis distance flags outliers), and generative classification (LDA/QDA assume class-conditionals are Gaussian). It is also appropriate when linear combinations of many small, independent effects drive the data; by a multivariate central limit effect, the Gaussian becomes a good approximation.
It is especially effective if scatter plots of variable pairs look roughly elliptical and if residuals appear unimodal and symmetric. If you must repeatedly evaluate densities or sample, and you can precompute a factorization of the covariance, the model is computationally tractable. However, if the data are multimodal (several separate clusters) or strongly non-elliptical (heavy tails, skewness), a single Gaussian may be too restrictive; consider mixtures of Gaussians or other distributions. In high dimensions with limited data, covariance estimation becomes ill-conditioned; then use regularization (e.g., add λI), shrinkage estimators, or low-rank structure (factor analysis) to remain robust.
⚠️Common Mistakes
• Inverting Σ directly: Numerical inversion is unstable and slow. Prefer a Cholesky decomposition Σ = L L^{\top} to compute log-densities and Mahalanobis distances without forming Σ^{-1}. • Forgetting positive definiteness: A sample covariance can be singular (especially when n ≤ d or features are collinear). Add a small ridge term λI to ensure Σ is positive definite and Cholesky succeeds. • Mixing biased and unbiased estimators: The MLE uses 1/n while the unbiased sample covariance uses 1/(n−1). Be consistent with your statistical goals when reporting or regularizing. • Ignoring units and scaling: Variables with wildly different scales can dominate distances. Standardize or carefully interpret Σ; Mahalanobis distance already accounts for scale, but ill-conditioned Σ still harms numerics. • Dropping the normalization constant: When computing true probabilities (or comparing non-nested models), include the determinant term; for relative comparisons within the same Σ, the constant may cancel, but not across different Σ. • Misinterpreting correlation as causation: A large off-diagonal covariance indicates association, not causal influence. • Assuming mixtures are Gaussian: A sum of two different Gaussians (mixture) is not Gaussian; only linear transformations and sums of independent Gaussians with the same random draw (not mixture) preserve normality. • Failing to use log-space: Products of small densities underflow; sum log-densities instead and use the log-pdf.
Key Formulas
Multivariate Gaussian PDF
Explanation: This gives the probability density at point x. It depends on the squared Mahalanobis distance and the determinant of the covariance for proper normalization.
Log-PDF
Explanation: Working in log-space avoids numerical underflow and turns products into sums. It is the standard way to compute likelihoods in high dimensions.
Mahalanobis Distance Squared
Explanation: This measures distance in units of standard deviation along principal directions. Equal-density contours correspond to constant Mahalanobis distance.
Mean Estimator (MLE)
Explanation: The maximum likelihood estimate of the mean is the sample average. It centers the Gaussian at the empirical mean of the data.
Covariance Estimator (MLE)
Explanation: This is the MLE for covariance assuming independent samples. Using 1/(n-1) instead yields the unbiased sample covariance.
Affine Transform of a Gaussian
Explanation: Applying a linear map and shift to a Gaussian preserves normality. The mean and covariance transform predictably.
Conditional Gaussian
Explanation: Conditioning one block of variables on another yields a Gaussian. The conditional mean is an affine function of the observed block, and the conditional covariance shrinks accordingly.
Sum of Independent Gaussians
Explanation: The sum of independent Gaussian vectors is Gaussian. Covariances add and means add when independence holds.
Differential Entropy
Explanation: Entropy of a multivariate normal depends only on the determinant of the covariance. Larger spread (determinant) means higher uncertainty.
KL Divergence Between Gaussians
Explanation: A closed-form measure of how one Gaussian differs from another. Useful in variational methods and model comparison.
Determinant via Eigenvalues
Explanation: For symmetric positive definite Σ with eigenvalues the determinant is their product. Log-determinant is the sum of their logs and is numerically stable.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Simple matrix/vector aliases 5 using Vec = vector<double>; 6 using Mat = vector<vector<double>>; 7 8 // Create an identity matrix of size d 9 Mat eye(int d) { Mat I(d, Vec(d, 0.0)); for (int i=0;i<d;++i) I[i][i]=1.0; return I; } 10 11 // Cholesky decomposition: Sigma = L * L^T for symmetric positive definite Sigma. 12 // Returns true on success, false if Sigma is not SPD. 13 bool cholesky(const Mat &Sigma, Mat &L) { 14 int d = (int)Sigma.size(); 15 L.assign(d, Vec(d, 0.0)); 16 for (int i = 0; i < d; ++i) { 17 for (int j = 0; j <= i; ++j) { 18 double sum = Sigma[i][j]; 19 for (int k = 0; k < j; ++k) sum -= L[i][k] * L[j][k]; 20 if (i == j) { 21 if (sum <= 0.0) return false; // not SPD 22 L[i][i] = sqrt(max(sum, 0.0)); 23 } else { 24 L[i][j] = sum / L[j][j]; 25 } 26 } 27 } 28 return true; 29 } 30 31 // Solve L * y = b for y (L is lower triangular) 32 Vec solveLower(const Mat &L, const Vec &b) { 33 int d = (int)L.size(); 34 Vec y(d, 0.0); 35 for (int i = 0; i < d; ++i) { 36 double sum = b[i]; 37 for (int k = 0; k < i; ++k) sum -= L[i][k] * y[k]; 38 y[i] = sum / L[i][i]; 39 } 40 return y; 41 } 42 43 // Solve L^T * x = y for x (L is lower triangular) 44 Vec solveUpperFromLowerT(const Mat &L, const Vec &y) { 45 int d = (int)L.size(); 46 Vec x(d, 0.0); 47 for (int i = d-1; i >= 0; --i) { 48 double sum = y[i]; 49 for (int k = i+1; k < d; ++k) sum -= L[k][i] * x[k]; 50 x[i] = sum / L[i][i]; 51 } 52 return x; 53 } 54 55 // Compute log(det(Sigma)) from its Cholesky factor L: log det = 2 * sum log L_ii 56 double logDetFromCholesky(const Mat &L) { 57 double s = 0.0; 58 for (int i = 0; i < (int)L.size(); ++i) s += log(L[i][i]); 59 return 2.0 * s; 60 } 61 62 // Mahalanobis distance squared using Cholesky: y = L^{-1}(x-mu); D^2 = y^T y 63 double mahalanobis2_from_cholesky(const Vec &x, const Vec &mu, const Mat &L) { 64 int d = (int)x.size(); 65 Vec xm(d); 66 for (int i = 0; i < d; ++i) xm[i] = x[i] - mu[i]; 67 Vec y = solveLower(L, xm); // y = L^{-1}(x-mu) 68 double d2 = 0.0; 69 for (double v : y) d2 += v * v; 70 return d2; 71 } 72 73 // Log-pdf of N(mu, Sigma) using precomputed Cholesky L of Sigma 74 double logpdf_mvn(const Vec &x, const Vec &mu, const Mat &L) { 75 int d = (int)x.size(); 76 double d2 = mahalanobis2_from_cholesky(x, mu, L); 77 double logdet = logDetFromCholesky(L); 78 const double log2pi = log(2.0 * M_PI); 79 return -0.5 * (d * log2pi + logdet + d2); 80 } 81 82 int main() { 83 // Example: 2D Gaussian with correlation 84 Vec mu = {1.0, -2.0}; 85 Mat Sigma = {{2.0, 0.8}, 86 {0.8, 1.0}}; // SPD if 2*1 - 0.8^2 > 0 87 88 Mat L; 89 if (!cholesky(Sigma, L)) { 90 cerr << "Covariance is not SPD; consider adding a small ridge to the diagonal.\n"; 91 return 1; 92 } 93 94 Vec x = {0.5, -1.0}; 95 96 double d2 = mahalanobis2_from_cholesky(x, mu, L); 97 double lp = logpdf_mvn(x, mu, L); 98 99 cout << fixed << setprecision(6); 100 cout << "Mahalanobis^2: " << d2 << "\n"; 101 cout << "Log-pdf: " << lp << "\n"; 102 103 // Demonstrate batched log-likelihood for multiple points 104 vector<Vec> data = { {0.5, -1.0}, {1.2, -2.1}, {2.0, -2.5} }; 105 double total_loglik = 0.0; 106 for (const auto &xi : data) total_loglik += logpdf_mvn(xi, mu, L); 107 cout << "Total log-likelihood of 3 points: " << total_loglik << "\n"; 108 return 0; 109 } 110
This program computes the Mahalanobis distance and the log-pdf of a multivariate Gaussian using a Cholesky factor of the covariance. It avoids forming the inverse by solving triangular systems, which is more stable and efficient. The log-determinant is read directly from the Cholesky diagonal.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 using Vec = vector<double>; 5 using Mat = vector<vector<double>>; 6 7 bool cholesky(const Mat &Sigma, Mat &L) { 8 int d = (int)Sigma.size(); 9 L.assign(d, vector<double>(d, 0.0)); 10 for (int i = 0; i < d; ++i) { 11 for (int j = 0; j <= i; ++j) { 12 double sum = Sigma[i][j]; 13 for (int k = 0; k < j; ++k) sum -= L[i][k] * L[j][k]; 14 if (i == j) { 15 if (sum <= 0.0) return false; 16 L[i][i] = sqrt(max(sum, 0.0)); 17 } else { 18 L[i][j] = sum / L[j][j]; 19 } 20 } 21 } 22 return true; 23 } 24 25 Vec matvec_lower(const Mat &L, const Vec &z) { 26 int d = (int)L.size(); 27 Vec x(d, 0.0); 28 for (int i = 0; i < d; ++i) { 29 double sum = 0.0; 30 for (int j = 0; j <= i; ++j) sum += L[i][j] * z[j]; 31 x[i] = sum; 32 } 33 return x; 34 } 35 36 int main(){ 37 // Define a 3D Gaussian 38 Vec mu = {0.0, 1.0, -1.0}; 39 Mat Sigma = { 40 {1.0, 0.5, 0.2}, 41 {0.5, 2.0, 0.3}, 42 {0.2, 0.3, 1.5} 43 }; 44 45 Mat L; 46 if (!cholesky(Sigma, L)) { 47 cerr << "Sigma not SPD. Add ridge: Sigma[i][i] += 1e-6 and retry.\n"; 48 return 1; 49 } 50 51 // Random engine and standard normals 52 std::mt19937_64 rng(42); 53 std::normal_distribution<double> stdnorm(0.0, 1.0); 54 55 auto sample = [&](void){ 56 int d = (int)mu.size(); 57 Vec z(d); for (int i=0;i<d;++i) z[i] = stdnorm(rng); // z ~ N(0, I) 58 Vec Lz = matvec_lower(L, z); // L z 59 Vec x(d); for (int i=0;i<d;++i) x[i] = mu[i] + Lz[i]; // x = mu + L z 60 return x; 61 }; 62 63 // Draw and print 5 samples 64 cout << fixed << setprecision(5); 65 for (int k = 0; k < 5; ++k) { 66 Vec x = sample(); 67 cout << "sample " << k << ": "; 68 for (int i=0;i<(int)x.size();++i) cout << x[i] << (i+1==(int)x.size()?"\n":" "); 69 } 70 71 return 0; 72 } 73
This program samples from a multivariate Gaussian by factoring Σ = L L^T and transforming standard normal draws z via x = μ + L z. Using Cholesky ensures the correct covariance and numerical stability.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 using Vec = vector<double>; 5 using Mat = vector<vector<double>>; 6 7 Mat zeros(int r, int c){ return Mat(r, Vec(c, 0.0)); } 8 9 bool cholesky(const Mat &Sigma, Mat &L) { 10 int d = (int)Sigma.size(); 11 L.assign(d, Vec(d, 0.0)); 12 for (int i = 0; i < d; ++i) { 13 for (int j = 0; j <= i; ++j) { 14 double sum = Sigma[i][j]; 15 for (int k = 0; k < j; ++k) sum -= L[i][k] * L[j][k]; 16 if (i == j) { 17 if (sum <= 0.0) return false; 18 L[i][i] = sqrt(max(sum, 0.0)); 19 } else { 20 L[i][j] = sum / L[j][j]; 21 } 22 } 23 } 24 return true; 25 } 26 27 Vec solveLower(const Mat &L, const Vec &b) { 28 int d = (int)L.size(); Vec y(d,0.0); 29 for (int i=0;i<d;++i){ double s=b[i]; for(int k=0;k<i;++k) s -= L[i][k]*y[k]; y[i]=s/L[i][i]; } 30 return y; 31 } 32 33 double logDetFromCholesky(const Mat &L){ double s=0.0; for(int i=0;i<(int)L.size();++i) s+=log(L[i][i]); return 2.0*s; } 34 35 double logpdf_mvn(const Vec &x, const Vec &mu, const Mat &L){ 36 int d=(int)x.size(); Vec xm(d); for(int i=0;i<d;++i) xm[i]=x[i]-mu[i]; 37 Vec y = solveLower(L, xm); double d2=0.0; for(double v:y) d2+=v*v; 38 double logdet = logDetFromCholesky(L); 39 return -0.5*(d*log(2.0*M_PI)+logdet+d2); 40 } 41 42 // Fit mean (MLE) and covariance (MLE) with optional ridge lambda on the diagonal 43 pair<Vec, Mat> fit_gaussian_mle(const vector<Vec> &data, double ridge_lambda=0.0){ 44 int n = (int)data.size(); 45 int d = (int)data[0].size(); 46 // Mean 47 Vec mu(d, 0.0); 48 for (const auto &x : data) for (int j=0;j<d;++j) mu[j] += x[j]; 49 for (int j=0;j<d;++j) mu[j] /= n; 50 // Covariance MLE (1/n) 51 Mat S = zeros(d, d); 52 for (const auto &x : data){ 53 for (int i=0;i<d;++i){ double xi = x[i]-mu[i]; 54 for (int j=0;j<=i;++j){ double xj = x[j]-mu[j]; S[i][j] += xi*xj; } 55 } 56 } 57 for (int i=0;i<d;++i){ 58 for (int j=0;j<=i;++j){ S[i][j] /= n; S[j][i] = S[i][j]; } 59 } 60 // Ridge regularization 61 for (int i=0;i<d;++i) S[i][i] += ridge_lambda; 62 return {mu, S}; 63 } 64 65 int main(){ 66 // Synthetic 2D dataset (e.g., two correlated features) 67 vector<Vec> data = { 68 {2.1, 1.0}, {1.9, 1.2}, {2.0, 1.1}, {2.2, 0.9}, {1.8, 1.3}, 69 {2.3, 0.8}, {1.7, 1.4}, {2.1, 1.0}, {2.0, 0.95}, {2.2, 1.05} 70 }; 71 72 // Fit parameters with a small ridge for numerical safety 73 double lambda = 1e-6; 74 auto [mu, Sigma] = fit_gaussian_mle(data, lambda); 75 76 // Factor for fast log-likelihood 77 Mat L; if (!cholesky(Sigma, L)) { cerr << "Failed Cholesky; increase ridge.\n"; return 1; } 78 79 // Compute average log-likelihood on the training data 80 double sum_loglik = 0.0; for (const auto &x : data) sum_loglik += logpdf_mvn(x, mu, L); 81 double avg_loglik = sum_loglik / (double)data.size(); 82 83 cout << fixed << setprecision(6); 84 cout << "Estimated mu: [" << mu[0] << ", " << mu[1] << "]\n"; 85 cout << "Estimated Sigma:\n"; 86 cout << "[" << Sigma[0][0] << ", " << Sigma[0][1] << "]\n"; 87 cout << "[" << Sigma[1][0] << ", " << Sigma[1][1] << "]\n"; 88 cout << "Average log-likelihood: " << avg_loglik << "\n"; 89 90 // Score a new point 91 Vec x_new = {2.4, 0.7}; 92 double lp_new = logpdf_mvn(x_new, mu, L); 93 cout << "Log-pdf of new point: " << lp_new << "\n"; 94 return 0; 95 } 96
This program estimates the mean and covariance by maximum likelihood from data, adds a small ridge to ensure positive definiteness, factors the covariance via Cholesky, and evaluates average log-likelihood and the log-pdf of a new point. It demonstrates end-to-end modeling and scoring.