∑MathIntermediate

Multivariate Gaussian Distribution

Key Points

•
A multivariate Gaussian (normal) distribution models a vector of real-valued variables with a bell-shaped probability hill in many dimensions.
•
It is defined by a mean vector that sets the center and a covariance matrix that sets the shape, size, and orientation of the probability ellipsoid.
•
Its probability density function uses the determinant and inverse (or Cholesky factor) of the covariance, so numerically stable linear algebra is essential.
•
Distances are measured by the Mahalanobis distance, which rescales and rotates space to account for correlations between variables.
•
Linear transformations of a multivariate Gaussian remain Gaussian, and marginals and conditionals are also Gaussian with closed-form parameters.
•
Maximum likelihood estimation of the mean and covariance has simple formulas, but the covariance must be positive definite; regularization often helps.
•
In code, avoid explicitly inverting the covariance; use Cholesky decomposition to compute log-densities, sample, and solve systems robustly.
•
Computationally, precompute a Cholesky factor once in O( $d^{3}$ ) and then evaluate many log-densities or draw samples in O( $d^{2}$ ) each.

Prerequisites

→Linear algebra (vectors, matrices, transpose, matrix multiplication) — The PDF and all computations rely on matrix operations, eigenvalues, and determinants.
→Probability and random variables — Understanding distributions, expectation, and independence is necessary to interpret μ and Σ.
→Variance and covariance — The covariance matrix encodes variances and covariances; their meaning drives model interpretation.
→Numerical stability and conditioning — Stable computation of log-densities and sampling requires avoiding explicit matrix inverses.
→Cholesky decomposition — Core tool for factoring Σ to compute Mahalanobis distances, log-determinants, and sampling.
→Random number generation — Sampling needs standard normal draws and reproducible RNGs.

Detailed Explanation

Tap terms for definitions

01Overview

The multivariate Gaussian distribution is a cornerstone of probability and statistics for modeling continuous, real-valued vectors. Think of it as a higher-dimensional extension of the familiar bell curve. Instead of a single mean and variance, it uses a mean vector to specify the center of mass and a covariance matrix to capture spread and correlation among components. Correlation matters: if two dimensions tend to increase together, the probability mass stretches along that shared direction. This distribution is popular because it is mathematically elegant and stable under many operations: linear transformations of a Gaussian remain Gaussian, and the marginals (subsets of coordinates) and conditionals (given others) are also Gaussian with tractable formulas.

In applications, multivariate Gaussians approximate sensor noise, model feature vectors in computer vision, underlie Kalman filters in tracking, and appear in Bayesian inference, PCA, and Gaussian process models. Computationally, the covariance matrix must be symmetric positive definite; this ensures the distribution is well-defined. Implementation-wise, computing probabilities and drawing samples efficiently and stably requires matrix decompositions—especially the Cholesky factorization—rather than direct matrix inversion. The result is a model that is both interpretable (through means, variances, and correlations) and practical for a wide range of engineering and scientific tasks.

02Intuition & Analogies

Picture a pile of sand poured onto a flat floor: in one dimension it forms a bell; in two dimensions it forms a smooth, mound-like hill; in more dimensions, it becomes a higher-dimensional hill you can’t see but can describe mathematically. The peak of the hill is at the mean vector μ, which says where the center of the mound sits. The shape of the hill is not necessarily circular; it can be stretched and tilted. That stretching and tilting are encoded by the covariance matrix Σ. If two coordinates tend to rise and fall together, the hill becomes elongated along that diagonal direction. If they vary independently, the hill is more circular (or spherical in higher dimensions).

A helpful mental model is to start with a spherical cloud of points (like a ball-shaped fog) where each direction has the same variance and no correlations. Apply a stretch-and-rotate operation (an affine transform): stretch more in some directions (larger variance), less in others, and rotate to align with preferred axes. The result is an ellipsoid-shaped cloud—the multivariate Gaussian. Distances inside this cloud are not best measured by plain Euclidean distance; instead, you measure how many standard deviations away a point is after undoing the stretch and rotation. That special distance is the Mahalanobis distance. If you rescale and rotate the space just right (by Σ^{-1/2}), the ellipsoid becomes a unit sphere and the distribution looks like a standard normal again.

In everyday terms: if you predict height and weight together, knowing someone is taller than average suggests they might also be heavier than average—those variables correlate. The multivariate Gaussian captures this idea by letting probability ‘lean’ along the joint trend rather than treating features independently.

03Formal Definition

A d-dimensional random vector X has a multivariate normal distribution with mean vector μ ∈

R^{d}

and covariance matrix Σ ∈

R^{d \times d}

(symmetric positive definite) if its probability density function is f(x) =

\frac{1}{( 2 π ) ^{d /2} d e t ( Σ )}

\exp\left

\frac{1}{2}

(x-

μ

)^{

⊤

}

Σ^{- 1}

(x-

μ

)

\right

). Here, det(Σ) is the determinant and Σ^{-1} is the inverse; the quadratic form (x−

μ)^{⊤}

Σ^{-1}(x−

μ)

is the squared Mahalanobis distance. Equivalently, X can be written as X = μ + A Z where Z ∼ N(0,

I_{d}

) and A

A^{⊤}

= Σ (e.g., A can be the Cholesky factor). Key closure properties include: (1) linear transformations of a Gaussian are Gaussian (AX + b ∼ N(Aμ + b, AΣ

A^{⊤}

)); (2) any subset of coordinates (marginal) is Gaussian with the corresponding subvector and submatrix of μ and Σ; (3) conditionals are Gaussian and can be computed from block partitions of μ and Σ. The covariance must be symmetric positive definite (all eigenvalues strictly positive) to ensure the density exists and integrates to 1. Estimation via maximum likelihood from independent samples {

x_{i}

} yields

\overset{μ}{^}

\frac{1}{n}

∑

x_{i}

and

\hat{Σ}

\frac{1}{n}

∑(

x_{i}

−

\overset{μ}{^}

)(

x_{i}

−

\overset{μ}{^}

)^{

⊤

} for the MLE (or 1/(n−1) for the unbiased estimator).

04When to Use

Use a multivariate Gaussian when you need to model real-valued vectors that cluster around a central mean and exhibit elliptical, correlated spread. Typical scenarios include sensor fusion and tracking (Kalman filters assume Gaussian process and measurement noise), feature modeling in vision and audio (where covariance captures feature correlations), anomaly detection (large Mahalanobis distance flags outliers), and generative classification (LDA/QDA assume class-conditionals are Gaussian). It is also appropriate when linear combinations of many small, independent effects drive the data; by a multivariate central limit effect, the Gaussian becomes a good approximation.

It is especially effective if scatter plots of variable pairs look roughly elliptical and if residuals appear unimodal and symmetric. If you must repeatedly evaluate densities or sample, and you can precompute a factorization of the covariance, the model is computationally tractable. However, if the data are multimodal (several separate clusters) or strongly non-elliptical (heavy tails, skewness), a single Gaussian may be too restrictive; consider mixtures of Gaussians or other distributions. In high dimensions with limited data, covariance estimation becomes ill-conditioned; then use regularization (e.g., add λI), shrinkage estimators, or low-rank structure (factor analysis) to remain robust.

⚠️Common Mistakes

• Inverting Σ directly: Numerical inversion is unstable and slow. Prefer a Cholesky decomposition Σ = L L^{\top} to compute log-densities and Mahalanobis distances without forming Σ^{-1}. • Forgetting positive definiteness: A sample covariance can be singular (especially when n ≤ d or features are collinear). Add a small ridge term λI to ensure Σ is positive definite and Cholesky succeeds. • Mixing biased and unbiased estimators: The MLE uses 1/n while the unbiased sample covariance uses 1/(n−1). Be consistent with your statistical goals when reporting or regularizing. • Ignoring units and scaling: Variables with wildly different scales can dominate distances. Standardize or carefully interpret Σ; Mahalanobis distance already accounts for scale, but ill-conditioned Σ still harms numerics. • Dropping the normalization constant: When computing true probabilities (or comparing non-nested models), include the determinant term; for relative comparisons within the same Σ, the constant may cancel, but not across different Σ. • Misinterpreting correlation as causation: A large off-diagonal covariance indicates association, not causal influence. • Assuming mixtures are Gaussian: A sum of two different Gaussians (mixture) is not Gaussian; only linear transformations and sums of independent Gaussians with the same random draw (not mixture) preserve normality. • Failing to use log-space: Products of small densities underflow; sum log-densities instead and use the log-pdf.

Key Formulas

Multivariate Gaussian PDF

f (x) = \frac{1}{( 2 π ) ^{d /2} det ( Σ )} exp (- \frac{1}{2} (x - μ)^{⊤} Σ^{- 1} (x - μ))

Explanation: This gives the probability density at point x. It depends on the squared Mahalanobis distance and the determinant of the covariance for proper normalization.

Log-PDF

lo g f (x) = - \frac{1}{2} (d lo g (2 π) + lo g det (Σ) + (x - μ)^{⊤} Σ^{- 1} (x - μ))

Explanation: Working in log-space avoids numerical underflow and turns products into sums. It is the standard way to compute likelihoods in high dimensions.

Mahalanobis Distance Squared

D_{M}^{2} (x) = (x - μ)^{⊤} Σ^{- 1} (x - μ)

Explanation: This measures distance in units of standard deviation along principal directions. Equal-density contours correspond to constant Mahalanobis distance.

Mean Estimator (MLE)

\overset{μ}{^} = \frac{1}{n} i = 1 \sum n x_{i}

Explanation: The maximum likelihood estimate of the mean is the sample average. It centers the Gaussian at the empirical mean of the data.

Covariance Estimator (MLE)

\hat{Σ}_{MLE} = \frac{1}{n} i = 1 \sum n (x_{i} - \overset{μ}{^}) (x_{i} - \overset{μ}{^})^{⊤}

Explanation: This is the MLE for covariance assuming independent samples. Using 1/(n-1) instead yields the unbiased sample covariance.

Affine Transform of a Gaussian

Y = A X + b, X \sim N (μ, Σ) \Rightarrow Y \sim N (A μ + b, A Σ A^{⊤})

Explanation: Applying a linear map and shift to a Gaussian preserves normality. The mean and covariance transform predictably.

Conditional Gaussian

[X_{1} X_{2}] \sim N ([μ_{1} μ_{2}], [Σ_{11} Σ_{21} Σ_{12} Σ_{22}]) \Rightarrow X_{1} ∣ X_{2} = x_{2} \sim N (μ_{1} + Σ_{12} Σ_{22}^{- 1} (x_{2} - μ_{2}), Σ_{11} - Σ_{12} Σ_{22}^{- 1} Σ_{21})

Explanation: Conditioning one block of variables on another yields a Gaussian. The conditional mean is an affine function of the observed block, and the conditional covariance shrinks accordingly.

Sum of Independent Gaussians

X + Y \sim N (μ_{X} + μ_{Y}, Σ_{X} + Σ_{Y}) if X ⊥ ⊥ Y

Explanation: The sum of independent Gaussian vectors is Gaussian. Covariances add and means add when independence holds.

Differential Entropy

H (X) = \frac{1}{2} lo g ((2 π e)^{d} det (Σ))

Explanation: Entropy of a multivariate normal depends only on the determinant of the covariance. Larger spread (determinant) means higher uncertainty.

KL Divergence Between Gaussians

D_{KL} (N (μ_{0}, Σ_{0}) ∥ N (μ_{1}, Σ_{1})) = \frac{1}{2} (tr (Σ_{1}^{- 1} Σ_{0}) + (μ_{1} - μ_{0})^{⊤} Σ_{1}^{- 1} (μ_{1} - μ_{0}) - d + lo g \frac{d e t Σ _{1}}{d e t Σ _{0}})

Explanation: A closed-form measure of how one Gaussian differs from another. Useful in variational methods and model comparison.

Determinant via Eigenvalues

det (Σ) = i = 1 \prod d λ_{i}, lo g det (Σ) = i = 1 \sum d lo g λ_{i}

Explanation: For symmetric positive definite Σ with eigenvalues $λ_{i},$ the determinant is their product. Log-determinant is the sum of their logs and is numerically stable.

Complexity Analysis

Core operations on a d-dimensional multivariate Gaussian revolve around linear algebra. A numerically stable approach uses the Cholesky factorization Σ = L

L^{T}

. Computing the Cholesky factor costs O(

d^{3}

) time and O(

d^{2}

) space. Once L is available, many repeated tasks become cheaper: evaluating the log-density for a single point requires solving two triangular systems (effectively computing

y = L^{- 1} (x - μ

)) and taking an inner product

y^{T}

y, which costs O(

d^{2}

) time; the log-determinant is 2∑ log

L_{ii}

, available in O(d) after factorization. Thus, for m points with a fixed Σ, total time is O(

d^{3}

+ m

d^{2}

). Sampling k points from N(

μ,

Σ) can be done by drawing z ∼ N(0, I) and setting x = μ + L z. Each sample requires a matrix–vector multiply with L, costing O(

d^{2}

), so total time is O(

d^{3}

+ k

d^{2}

) including the one-time Cholesky. Memory usage remains O(

d^{2}

) for L and O(d) per vector. Estimating μ and Σ from n samples requires O(n d) to compute the mean and O(n

d^{2}

) to accumulate the covariance (outer products). If needed, a Cholesky factorization of the estimate adds O(

d^{3}

). In high dimensions (large d) with limited n, Σ may be ill-conditioned or singular; adding a small ridge

λ I

preserves positive definiteness and enables stable O(

d^{3}

) factorizations. Avoid explicit matrix inversion (O(

d^{3}

) and numerically fragile); instead, reuse L to solve linear systems and compute Mahalanobis distances in O(

d^{2}

Code Examples

Stable log-pdf and Mahalanobis distance using Cholesky (no explicit inverse)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Simple matrix/vector aliases
5 using Vec = vector<double>;
6 using Mat = vector<vector<double>>;
7 
8 // Create an identity matrix of size d
9 Mat eye(int d) { Mat I(d, Vec(d, 0.0)); for (int i=0;i<d;++i) I[i][i]=1.0; return I; }
10 
11 // Cholesky decomposition: Sigma = L * L^T for symmetric positive definite Sigma.
12 // Returns true on success, false if Sigma is not SPD.
13 bool cholesky(const Mat &Sigma, Mat &L) {
14     int d = (int)Sigma.size();
15     L.assign(d, Vec(d, 0.0));
16     for (int i = 0; i < d; ++i) {
17         for (int j = 0; j <= i; ++j) {
18             double sum = Sigma[i][j];
19             for (int k = 0; k < j; ++k) sum -= L[i][k] * L[j][k];
20             if (i == j) {
21                 if (sum <= 0.0) return false; // not SPD
22                 L[i][i] = sqrt(max(sum, 0.0));
23             } else {
24                 L[i][j] = sum / L[j][j];
25             }
26         }
27     }
28     return true;
29 }
30 
31 // Solve L * y = b for y (L is lower triangular)
32 Vec solveLower(const Mat &L, const Vec &b) {
33     int d = (int)L.size();
34     Vec y(d, 0.0);
35     for (int i = 0; i < d; ++i) {
36         double sum = b[i];
37         for (int k = 0; k < i; ++k) sum -= L[i][k] * y[k];
38         y[i] = sum / L[i][i];
39     }
40     return y;
41 }
42 
43 // Solve L^T * x = y for x (L is lower triangular)
44 Vec solveUpperFromLowerT(const Mat &L, const Vec &y) {
45     int d = (int)L.size();
46     Vec x(d, 0.0);
47     for (int i = d-1; i >= 0; --i) {
48         double sum = y[i];
49         for (int k = i+1; k < d; ++k) sum -= L[k][i] * x[k];
50         x[i] = sum / L[i][i];
51     }
52     return x;
53 }
54 
55 // Compute log(det(Sigma)) from its Cholesky factor L: log det = 2 * sum log L_ii
56 double logDetFromCholesky(const Mat &L) {
57     double s = 0.0;
58     for (int i = 0; i < (int)L.size(); ++i) s += log(L[i][i]);
59     return 2.0 * s;
60 }
61 
62 // Mahalanobis distance squared using Cholesky: y = L^{-1}(x-mu); D^2 = y^T y
63 double mahalanobis2_from_cholesky(const Vec &x, const Vec &mu, const Mat &L) {
64     int d = (int)x.size();
65     Vec xm(d);
66     for (int i = 0; i < d; ++i) xm[i] = x[i] - mu[i];
67     Vec y = solveLower(L, xm); // y = L^{-1}(x-mu)
68     double d2 = 0.0;
69     for (double v : y) d2 += v * v;
70     return d2;
71 }
72 
73 // Log-pdf of N(mu, Sigma) using precomputed Cholesky L of Sigma
74 double logpdf_mvn(const Vec &x, const Vec &mu, const Mat &L) {
75     int d = (int)x.size();
76     double d2 = mahalanobis2_from_cholesky(x, mu, L);
77     double logdet = logDetFromCholesky(L);
78     const double log2pi = log(2.0 * M_PI);
79     return -0.5 * (d * log2pi + logdet + d2);
80 }
81 
82 int main() {
83     // Example: 2D Gaussian with correlation
84     Vec mu = {1.0, -2.0};
85     Mat Sigma = {{2.0, 0.8},
86                  {0.8, 1.0}}; // SPD if 2*1 - 0.8^2 > 0
87 
88     Mat L;
89     if (!cholesky(Sigma, L)) {
90         cerr << "Covariance is not SPD; consider adding a small ridge to the diagonal.\n";
91         return 1;
92     }
93 
94     Vec x = {0.5, -1.0};
95 
96     double d2 = mahalanobis2_from_cholesky(x, mu, L);
97     double lp = logpdf_mvn(x, mu, L);
98 
99     cout << fixed << setprecision(6);
100     cout << "Mahalanobis^2: " << d2 << "\n";
101     cout << "Log-pdf:       " << lp << "\n";
102 
103     // Demonstrate batched log-likelihood for multiple points
104     vector<Vec> data = { {0.5, -1.0}, {1.2, -2.1}, {2.0, -2.5} };
105     double total_loglik = 0.0;
106     for (const auto &xi : data) total_loglik += logpdf_mvn(xi, mu, L);
107     cout << "Total log-likelihood of 3 points: " << total_loglik << "\n";
108     return 0;
109 }
110

This program computes the Mahalanobis distance and the log-pdf of a multivariate Gaussian using a Cholesky factor of the covariance. It avoids forming the inverse by solving triangular systems, which is more stable and efficient. The log-determinant is read directly from the Cholesky diagonal.

Time: O(d^3) to factor once, then O(d^2) per log-pdf evaluationSpace: O(d^2) for the Cholesky factor plus O(d) temporaries

Sampling from N(μ, Σ) using Cholesky and standard normals

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 using Vec = vector<double>;
5 using Mat = vector<vector<double>>;
6 
7 bool cholesky(const Mat &Sigma, Mat &L) {
8     int d = (int)Sigma.size();
9     L.assign(d, vector<double>(d, 0.0));
10     for (int i = 0; i < d; ++i) {
11         for (int j = 0; j <= i; ++j) {
12             double sum = Sigma[i][j];
13             for (int k = 0; k < j; ++k) sum -= L[i][k] * L[j][k];
14             if (i == j) {
15                 if (sum <= 0.0) return false;
16                 L[i][i] = sqrt(max(sum, 0.0));
17             } else {
18                 L[i][j] = sum / L[j][j];
19             }
20         }
21     }
22     return true;
23 }
24 
25 Vec matvec_lower(const Mat &L, const Vec &z) {
26     int d = (int)L.size();
27     Vec x(d, 0.0);
28     for (int i = 0; i < d; ++i) {
29         double sum = 0.0;
30         for (int j = 0; j <= i; ++j) sum += L[i][j] * z[j];
31         x[i] = sum;
32     }
33     return x;
34 }
35 
36 int main(){
37     // Define a 3D Gaussian
38     Vec mu = {0.0, 1.0, -1.0};
39     Mat Sigma = {
40         {1.0, 0.5, 0.2},
41         {0.5, 2.0, 0.3},
42         {0.2, 0.3, 1.5}
43     };
44 
45     Mat L;
46     if (!cholesky(Sigma, L)) {
47         cerr << "Sigma not SPD. Add ridge: Sigma[i][i] += 1e-6 and retry.\n";
48         return 1;
49     }
50 
51     // Random engine and standard normals
52     std::mt19937_64 rng(42);
53     std::normal_distribution<double> stdnorm(0.0, 1.0);
54 
55     auto sample = [&](void){
56         int d = (int)mu.size();
57         Vec z(d); for (int i=0;i<d;++i) z[i] = stdnorm(rng); // z ~ N(0, I)
58         Vec Lz = matvec_lower(L, z);                        // L z
59         Vec x(d); for (int i=0;i<d;++i) x[i] = mu[i] + Lz[i]; // x = mu + L z
60         return x;
61     };
62 
63     // Draw and print 5 samples
64     cout << fixed << setprecision(5);
65     for (int k = 0; k < 5; ++k) {
66         Vec x = sample();
67         cout << "sample " << k << ": ";
68         for (int i=0;i<(int)x.size();++i) cout << x[i] << (i+1==(int)x.size()?"\n":" ");
69     }
70 
71     return 0;
72 }
73

This program samples from a multivariate Gaussian by factoring Σ = L L^T and transforming standard normal draws z via x = μ + L z. Using Cholesky ensures the correct covariance and numerical stability.

Time: O(d^3) preprocessing for Cholesky, then O(d^2) per sampleSpace: O(d^2) for storing L and O(d) per vector

Fit μ and Σ by MLE from data with ridge regularization, then evaluate log-likelihood

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 using Vec = vector<double>;
5 using Mat = vector<vector<double>>;
6 
7 Mat zeros(int r, int c){ return Mat(r, Vec(c, 0.0)); }
8 
9 bool cholesky(const Mat &Sigma, Mat &L) {
10     int d = (int)Sigma.size();
11     L.assign(d, Vec(d, 0.0));
12     for (int i = 0; i < d; ++i) {
13         for (int j = 0; j <= i; ++j) {
14             double sum = Sigma[i][j];
15             for (int k = 0; k < j; ++k) sum -= L[i][k] * L[j][k];
16             if (i == j) {
17                 if (sum <= 0.0) return false;
18                 L[i][i] = sqrt(max(sum, 0.0));
19             } else {
20                 L[i][j] = sum / L[j][j];
21             }
22         }
23     }
24     return true;
25 }
26 
27 Vec solveLower(const Mat &L, const Vec &b) {
28     int d = (int)L.size(); Vec y(d,0.0);
29     for (int i=0;i<d;++i){ double s=b[i]; for(int k=0;k<i;++k) s -= L[i][k]*y[k]; y[i]=s/L[i][i]; }
30     return y;
31 }
32 
33 double logDetFromCholesky(const Mat &L){ double s=0.0; for(int i=0;i<(int)L.size();++i) s+=log(L[i][i]); return 2.0*s; }
34 
35 double logpdf_mvn(const Vec &x, const Vec &mu, const Mat &L){
36     int d=(int)x.size(); Vec xm(d); for(int i=0;i<d;++i) xm[i]=x[i]-mu[i];
37     Vec y = solveLower(L, xm); double d2=0.0; for(double v:y) d2+=v*v;
38     double logdet = logDetFromCholesky(L);
39     return -0.5*(d*log(2.0*M_PI)+logdet+d2);
40 }
41 
42 // Fit mean (MLE) and covariance (MLE) with optional ridge lambda on the diagonal
43 pair<Vec, Mat> fit_gaussian_mle(const vector<Vec> &data, double ridge_lambda=0.0){
44     int n = (int)data.size();
45     int d = (int)data[0].size();
46     // Mean
47     Vec mu(d, 0.0);
48     for (const auto &x : data) for (int j=0;j<d;++j) mu[j] += x[j];
49     for (int j=0;j<d;++j) mu[j] /= n;
50     // Covariance MLE (1/n)
51     Mat S = zeros(d, d);
52     for (const auto &x : data){
53         for (int i=0;i<d;++i){ double xi = x[i]-mu[i];
54             for (int j=0;j<=i;++j){ double xj = x[j]-mu[j]; S[i][j] += xi*xj; }
55         }
56     }
57     for (int i=0;i<d;++i){
58         for (int j=0;j<=i;++j){ S[i][j] /= n; S[j][i] = S[i][j]; }
59     }
60     // Ridge regularization
61     for (int i=0;i<d;++i) S[i][i] += ridge_lambda;
62     return {mu, S};
63 }
64 
65 int main(){
66     // Synthetic 2D dataset (e.g., two correlated features)
67     vector<Vec> data = {
68         {2.1, 1.0}, {1.9, 1.2}, {2.0, 1.1}, {2.2, 0.9}, {1.8, 1.3},
69         {2.3, 0.8}, {1.7, 1.4}, {2.1, 1.0}, {2.0, 0.95}, {2.2, 1.05}
70     };
71 
72     // Fit parameters with a small ridge for numerical safety
73     double lambda = 1e-6;
74     auto [mu, Sigma] = fit_gaussian_mle(data, lambda);
75 
76     // Factor for fast log-likelihood
77     Mat L; if (!cholesky(Sigma, L)) { cerr << "Failed Cholesky; increase ridge.\n"; return 1; }
78 
79     // Compute average log-likelihood on the training data
80     double sum_loglik = 0.0; for (const auto &x : data) sum_loglik += logpdf_mvn(x, mu, L);
81     double avg_loglik = sum_loglik / (double)data.size();
82 
83     cout << fixed << setprecision(6);
84     cout << "Estimated mu: [" << mu[0] << ", " << mu[1] << "]\n";
85     cout << "Estimated Sigma:\n";
86     cout << "[" << Sigma[0][0] << ", " << Sigma[0][1] << "]\n";
87     cout << "[" << Sigma[1][0] << ", " << Sigma[1][1] << "]\n";
88     cout << "Average log-likelihood: " << avg_loglik << "\n";
89 
90     // Score a new point
91     Vec x_new = {2.4, 0.7};
92     double lp_new = logpdf_mvn(x_new, mu, L);
93     cout << "Log-pdf of new point: " << lp_new << "\n";
94     return 0;
95 }
96

This program estimates the mean and covariance by maximum likelihood from data, adds a small ridge to ensure positive definiteness, factors the covariance via Cholesky, and evaluates average log-likelihood and the log-pdf of a new point. It demonstrates end-to-end modeling and scoring.

Time: O(n d^2) to accumulate covariance, O(d^3) for Cholesky, O(d^2) per log-pdfSpace: O(d^2) for covariance and Cholesky, O(n d) for data storage

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Simple matrix/vector aliases
5	using Vec = vector<double>;
6	using Mat = vector<vector<double>>;
7
8	// Create an identity matrix of size d
9	Mat eye(int d) { Mat I(d, Vec(d, 0.0)); for (int i=0;i<d;++i) I[i][i]=1.0; return I; }
10
11	// Cholesky decomposition: Sigma = L * L^T for symmetric positive definite Sigma.
12	// Returns true on success, false if Sigma is not SPD.
13	bool cholesky(const Mat &Sigma, Mat &L) {
14	int d = (int)Sigma.size();
15	L.assign(d, Vec(d, 0.0));
16	for (int i = 0; i < d; ++i) {
17	for (int j = 0; j <= i; ++j) {
18	double sum = Sigma[i][j];
19	for (int k = 0; k < j; ++k) sum -= L[i][k] * L[j][k];
20	if (i == j) {
21	if (sum <= 0.0) return false; // not SPD
22	L[i][i] = sqrt(max(sum, 0.0));
23	} else {
24	L[i][j] = sum / L[j][j];
25	}
26	}
27	}
28	return true;
29	}
30
31	// Solve L * y = b for y (L is lower triangular)
32	Vec solveLower(const Mat &L, const Vec &b) {
33	int d = (int)L.size();
34	Vec y(d, 0.0);
35	for (int i = 0; i < d; ++i) {
36	double sum = b[i];
37	for (int k = 0; k < i; ++k) sum -= L[i][k] * y[k];
38	y[i] = sum / L[i][i];
39	}
40	return y;
41	}
42
43	// Solve L^T * x = y for x (L is lower triangular)
44	Vec solveUpperFromLowerT(const Mat &L, const Vec &y) {
45	int d = (int)L.size();
46	Vec x(d, 0.0);
47	for (int i = d-1; i >= 0; --i) {
48	double sum = y[i];
49	for (int k = i+1; k < d; ++k) sum -= L[k][i] * x[k];
50	x[i] = sum / L[i][i];
51	}
52	return x;
53	}
54
55	// Compute log(det(Sigma)) from its Cholesky factor L: log det = 2 * sum log L_ii
56	double logDetFromCholesky(const Mat &L) {
57	double s = 0.0;
58	for (int i = 0; i < (int)L.size(); ++i) s += log(L[i][i]);
59	return 2.0 * s;
60	}
61
62	// Mahalanobis distance squared using Cholesky: y = L^{-1}(x-mu); D^2 = y^T y
63	double mahalanobis2_from_cholesky(const Vec &x, const Vec &mu, const Mat &L) {
64	int d = (int)x.size();
65	Vec xm(d);
66	for (int i = 0; i < d; ++i) xm[i] = x[i] - mu[i];
67	Vec y = solveLower(L, xm); // y = L^{-1}(x-mu)
68	double d2 = 0.0;
69	for (double v : y) d2 += v * v;
70	return d2;
71	}
72
73	// Log-pdf of N(mu, Sigma) using precomputed Cholesky L of Sigma
74	double logpdf_mvn(const Vec &x, const Vec &mu, const Mat &L) {
75	int d = (int)x.size();
76	double d2 = mahalanobis2_from_cholesky(x, mu, L);
77	double logdet = logDetFromCholesky(L);
78	const double log2pi = log(2.0 * M_PI);
79	return -0.5 * (d * log2pi + logdet + d2);
80	}
81
82	int main() {
83	// Example: 2D Gaussian with correlation
84	Vec mu = {1.0, -2.0};
85	Mat Sigma = {{2.0, 0.8},
86	{0.8, 1.0}}; // SPD if 2*1 - 0.8^2 > 0
87
88	Mat L;
89	if (!cholesky(Sigma, L)) {
90	cerr << "Covariance is not SPD; consider adding a small ridge to the diagonal.\n";
91	return 1;
92	}
93
94	Vec x = {0.5, -1.0};
95
96	double d2 = mahalanobis2_from_cholesky(x, mu, L);
97	double lp = logpdf_mvn(x, mu, L);
98
99	cout << fixed << setprecision(6);
100	cout << "Mahalanobis^2: " << d2 << "\n";
101	cout << "Log-pdf: " << lp << "\n";
102
103	// Demonstrate batched log-likelihood for multiple points
104	vector<Vec> data = { {0.5, -1.0}, {1.2, -2.1}, {2.0, -2.5} };
105	double total_loglik = 0.0;
106	for (const auto &xi : data) total_loglik += logpdf_mvn(xi, mu, L);
107	cout << "Total log-likelihood of 3 points: " << total_loglik << "\n";
108	return 0;
109	}
110