Expectation, Variance & Moments
Key Points
- •Expectation is the long-run average value of a random variable and acts like the balance point of its distribution.
- •Variance measures how widely values spread around the mean, and it equals E[] minus (E[X])^2.
- •Moments, like E[], summarize the shape of a distribution; central moments around the mean capture spread, skewness, and kurtosis.
- •Linearity of expectation holds without independence, but variance additivity generally requires independence.
- •Numerically stable, one-pass algorithms (like Welford’s) are preferred for computing variance on real data.
- •Monte Carlo simulation estimates expectations by averaging many random samples, with error shrinking like 1/.
- •Use unbiased sample variance with n-1 in the denominator when estimating population variance from data.
- •Avoid assuming E[g(X)] = g(E[X]) unless g is linear; Jensen’s inequality describes the correct relationship for convex functions.
Prerequisites
- →Random Variables and Probability Distributions — Expectation and variance are defined with respect to distributions over possible values.
- →Summation and Integration — Expectations are computed via sums (discrete) or integrals (continuous).
- →Independence and Covariance — Understanding when variances add and how dependencies affect spread requires these concepts.
- →Basic Calculus — Integration, derivatives, and series underpin moments and MGFs.
- →Floating-Point Arithmetic — Practical computation of moments demands awareness of numerical stability and rounding error.
- →Algorithmic Complexity — Evaluating one-pass versus two-pass computations and Monte Carlo scaling requires complexity analysis.
Detailed Explanation
Tap terms for definitions01Overview
Expectation, variance, and moments are core tools for describing and reasoning about randomness. Expectation (E[X]) is the long-run average value you’d get if you could repeat an experiment indefinitely. Variance (Var(X)) tells you how much variability there is around that average. Moments generalize these ideas: the k-th moment E[X^k] captures aspects of the distribution’s shape—second moment relates to spread, third to asymmetry (skewness), and fourth to tail heaviness (kurtosis). A foundational identity connects variance to moments: Var(X) = E[X^2] - (E[X])^2. This identity is widely used in both theory and practice because it converts a measure of spread into expectations that are often easier to compute. In computer science and data practice, these quantities guide algorithm analysis (expected runtime), simulation (Monte Carlo methods), and data summarization. For instance, linearity of expectation simplifies the analysis of randomized algorithms even when components are dependent. Practically, we estimate these values from data streams or simulations using numerically stable one-pass methods that avoid precision loss. Understanding when these measures exist (finite integrals/sums) and how to compute them robustly is crucial for building reliable software that interacts with randomness, from A/B testing to probabilistic data structures and machine learning.
02Intuition & Analogies
Imagine a seesaw (teeter-totter) with weights placed at different positions along a line. The point where the seesaw balances is like the expectation: it’s the weighted average of positions, with probabilities acting as weights. If the weights are spread far from the balance point, the seesaw is harder to balance—that spread corresponds to variance. Now think of throwing darts at a target. The center of the cluster is the expectation; the tightness of the cluster is related to variance. If most darts are close to the center, variance is small; if they’re all over the place, variance is large. Moments refine this picture. The second raw moment E[X^2] measures average squared distance from zero, while the variance E[(X - E[X])^2] measures squared distance from the mean. The third central moment captures whether the cluster leans to one side (skewness): more mass on the right yields positive skew. The fourth central moment gauges how heavy the tails are (kurtosis): more extreme outliers increase it. Moments are like a multi-lens camera focusing on different aspects of shape: average (first), spread (second), tilt (third), and tail weight (fourth). For computation, think of expectation as an average you can estimate by sampling repeatedly and averaging results. Variance can be found either by computing E[X^2] and E[X] and using Var(X) = E[X^2] - (E[X])^2, or more stably by tracking how each new observation changes the mean and the sum of squared deviations (Welford’s algorithm). The sampling perspective explains why Monte Carlo error shrinks like 1/sqrt(n): more samples average out randomness.
03Formal Definition
04When to Use
Use expectation when you need the typical or average outcome of a random process: expected runtime of an algorithm, expected reward in decision-making, or average value in simulation. Leverage linearity of expectation to simplify sums of many random components, even if they are dependent—for example, analyzing the expected number of collisions in hashing. Use variance when quantifying uncertainty or risk. In A/B testing, a smaller variance leads to tighter confidence intervals. In algorithm analysis, variance helps bound tail probabilities via Chebyshev’s inequality. When combining independent random effects (e.g., sensor noise), variances add, enabling error budgeting. Use moments for deeper shape diagnostics. The second moment underpins energy computations (e.g., Parseval-like contexts), the third and fourth central moments inform skewness and kurtosis, and raw moments are essential in series expansions or method-of-moments parameter estimation. Moment generating functions, when they exist, uniquely characterize distributions and make it easy to compute moments by differentiation. In implementation, choose numerically stable one-pass algorithms (Welford/Chan) for online or streaming data and when memory is limited. Use Monte Carlo averages to estimate expectations of complex functions where analytical integration is infeasible. Prefer unbiased sample variance (with n-1) when estimating a population variance from i.i.d. samples.
⚠️Common Mistakes
• Confusing sample vs population variance: Using 1/n instead of 1/(n-1) when estimating variance from sample data produces a biased estimator; use the unbiased sample variance with n-1 for i.i.d. samples. • Assuming independence to add variances: Var(X + Y) = Var(X) + Var(Y) only if X and Y are independent (or uncorrelated with additional conditions). Otherwise, a covariance term appears. • Believing E[g(X)] = g(E[X]) for nonlinear g: This is false except for linear functions. Jensen’s inequality tells you the correct direction for convex/concave g. • Numerical instability: Computing variance via two-pass subtraction Var ≈ (\overline{x^2} - \bar{x}^2) can suffer catastrophic cancellation when the mean is large relative to the spread. Use Welford’s algorithm or compensated summation. • Ignoring existence conditions: Moments may diverge for heavy-tailed distributions (e.g., Cauchy has undefined mean and variance). Always check finiteness before relying on identities. • Misinterpreting units: Variance has squared units; standard deviation (the square root) returns to the original units and is often more interpretable. • Small-sample overconfidence: Monte Carlo error decreases like 1/\sqrt{n}; with few samples, spread estimates are very noisy. Report standard errors and confidence intervals. • Mixing weights incorrectly: For weighted data, both mean and variance must use the same weights; do not reuse unweighted formulas on weighted samples.
Key Formulas
Discrete Expectation
Explanation: The expectation of a discrete random variable is the weighted average of its values using their probabilities. It exists when the series converges absolutely.
Continuous Expectation
Explanation: For a continuous variable with density , the expectation is the integral of x times the density over the real line. It must be finite to be meaningful.
Raw and Central Moments
Explanation: Raw moments are expectations of powers of X; central moments are expectations of powers of deviations from the mean. They capture shape features like spread, skewness, and tails.
Variance Identity
Explanation: Variance equals the expected squared deviation from the mean and can be computed as the second raw moment minus the square of the mean. This identity is widely used for analysis and computation.
Linearity of Expectation
Explanation: Expectation distributes over addition and scales by constants, regardless of whether variables are independent. This property simplifies many analyses.
Affine Variance
Explanation: Shifting by b does not change variance; scaling by a multiplies variance by . This follows from the definition of variance.
Covariance
Explanation: Covariance measures joint variability. If X and Y are independent with finite second moments, then E[XY] = E[X]E[Y] and covariance is zero.
Variance of a Sum
Explanation: Variance of a sum includes variances plus twice the covariance. If X and Y are independent, the covariance term vanishes.
Moment Generating Function
Explanation: The MGF, when it exists near zero, encodes all raw moments as derivatives at zero. It can uniquely identify distributions.
Law of Total Expectation
Explanation: The expected value can be computed by conditioning on another variable and then averaging. This is useful for hierarchical models and staged experiments.
Chebyshev’s Inequality
Explanation: The probability of deviating from the mean by k standard deviations is at most 1/. It holds for any distribution with finite variance.
Jensen’s Inequality
Explanation: A convex function of the mean is no larger than the mean of the function. It explains why E[g(X)] generally differs from g(E[X]) for nonlinear g.
Sample Mean and Unbiased Variance
Explanation: These are standard estimators for population mean and variance from i.i.d. data. The n-1 factor removes bias in variance estimation.
Variance of the Sample Mean
Explanation: Averaging n i.i.d. samples reduces variance by a factor of n. The standard error of the mean is .
Complexity Analysis
Code Examples
1 #include <iostream> 2 #include <vector> 3 #include <cmath> 4 #include <limits> 5 6 // A class that computes mean and variance in one pass using Welford's algorithm. 7 // It also exposes population and unbiased sample variance. 8 class OnlineStats { 9 public: 10 OnlineStats() : n_(0), mean_(0.0), M2_(0.0) {} 11 12 // Add a new observation x 13 void add(double x) { 14 ++n_; 15 double delta = x - mean_; 16 mean_ += delta / static_cast<double>(n_); 17 double delta2 = x - mean_; 18 M2_ += delta * delta2; // Accumulate sum of squared deviations 19 } 20 21 // Merge another OnlineStats into this one (Chan's method) 22 void merge(const OnlineStats& other) { 23 if (other.n_ == 0) return; 24 if (n_ == 0) { *this = other; return; } 25 double n_total = static_cast<double>(n_ + other.n_); 26 double delta = other.mean_ - mean_; 27 M2_ = M2_ + other.M2_ + delta * delta * (static_cast<double>(n_) * static_cast<double>(other.n_) / n_total); 28 mean_ = (static_cast<double>(n_) * mean_ + static_cast<double>(other.n_) * other.mean_) / n_total; 29 n_ += other.n_; 30 } 31 32 // Number of observations 33 std::size_t count() const { return n_; } 34 35 // Mean of the observations 36 double mean() const { return mean_; } 37 38 // Population variance: Var = M2 / n (defined if n > 0) 39 double variance_population() const { return (n_ > 0) ? (M2_ / static_cast<double>(n_)) : std::numeric_limits<double>::quiet_NaN(); } 40 41 // Unbiased sample variance: s^2 = M2 / (n - 1) (defined if n > 1) 42 double variance_unbiased() const { return (n_ > 1) ? (M2_ / static_cast<double>(n_ - 1)) : std::numeric_limits<double>::quiet_NaN(); } 43 44 private: 45 std::size_t n_; // count 46 double mean_; // running mean 47 double M2_; // sum of squared deviations from the mean 48 }; 49 50 int main() { 51 // Example data: imagine these are i.i.d. measurements 52 std::vector<double> data = {2.0, 4.0, 4.0, 4.0, 5.0, 5.0, 7.0, 9.0}; 53 54 OnlineStats stats; 55 for (double x : data) stats.add(x); 56 57 std::cout << "Count: " << stats.count() << "\n"; 58 std::cout << "Mean: " << stats.mean() << "\n"; 59 std::cout << "Var (population): " << stats.variance_population() << "\n"; 60 std::cout << "Var (unbiased sample): " << stats.variance_unbiased() << "\n"; 61 62 // Verify Var = E[X^2] - (E[X])^2 using the same data (two-pass for demonstration only) 63 double mean = 0.0, mean2 = 0.0; 64 for (double x : data) { mean += x; mean2 += x * x; } 65 mean /= data.size(); 66 mean2 /= data.size(); 67 double var_identity = mean2 - mean * mean; 68 69 std::cout << "Var via identity E[X^2] - (E[X])^2: " << var_identity << "\n"; 70 return 0; 71 } 72
This program computes the mean and variance in a single pass using Welford’s algorithm, which is numerically stable. It also demonstrates the identity Var(X) = E[X^2] - (E[X])^2 using a simple two-pass calculation for verification. The merge method anticipates parallel aggregation (used in the third example).
1 #include <iostream> 2 #include <random> 3 #include <cmath> 4 #include <limits> 5 6 // Estimate expectations by averaging many random samples. 7 int main() { 8 std::mt19937_64 rng(123456789ULL); // Fixed seed for reproducibility 9 std::normal_distribution<double> dist(0.0, 1.0); // Standard normal N(0,1) 10 11 const std::size_t n = 1000000; // Number of samples (1e6) 12 13 long double sum = 0.0L; // For E[X] 14 long double sumsq = 0.0L; // For E[X^2] 15 long double sum4 = 0.0L; // For E[X^4] (should be 3 for N(0,1)) 16 17 for (std::size_t i = 0; i < n; ++i) { 18 double x = dist(rng); 19 sum += x; 20 sumsq += x * x; 21 double x2 = x * x; 22 sum4 += x2 * x2; 23 } 24 25 long double EX = sum / n; 26 long double EX2 = sumsq / n; 27 long double EX4 = sum4 / n; 28 long double VarX = EX2 - EX * EX; // Identity Var = E[X^2] - (E[X])^2 29 30 // Standard error estimates for the mean and for E[X^2] 31 long double s2_for_mean = VarX; // Var(X) 32 long double se_mean = std::sqrt(s2_for_mean / n); 33 34 // For E[X^2], variance is Var(X^2)/n; estimate Var(X^2) using E[X^4] - (E[X^2])^2 35 long double VarX2 = EX4 - EX2 * EX2; 36 long double se_EX2 = std::sqrt(VarX2 / n); 37 38 std::cout.setf(std::ios::fixed); std::cout.precision(6); 39 std::cout << "Monte Carlo with n = " << n << "\n"; 40 std::cout << "E[X] ≈ " << static_cast<double>(EX) << " (theory: 0)\n"; 41 std::cout << "E[X^2] ≈ " << static_cast<double>(EX2) << " (theory: 1)\n"; 42 std::cout << "Var(X) ≈ " << static_cast<double>(VarX) << " (theory: 1)\n"; 43 std::cout << "E[X^4] ≈ " << static_cast<double>(EX4) << " (theory: 3)\n"; 44 std::cout << "SE(E[X]) ≈ " << static_cast<double>(se_mean) << "\n"; 45 std::cout << "SE(E[X^2]) ≈ " << static_cast<double>(se_EX2) << "\n"; 46 return 0; 47 } 48
This Monte Carlo program estimates E[X], E[X^2], Var(X), and E[X^4] for X ~ N(0,1) using 1e6 samples. It verifies Var(X) = E[X^2] - (E[X])^2 and reports standard errors that shrink as 1/sqrt(n). Using long double accumulators reduces rounding error in large sums.
1 #include <iostream> 2 #include <vector> 3 #include <cmath> 4 5 struct Stats { 6 std::size_t n = 0; 7 long double mean = 0.0L; 8 long double M2 = 0.0L; // sum of squared deviations 9 }; 10 11 // Update with a new value (Welford step) 12 void update(Stats &s, long double x) { 13 s.n += 1; 14 long double delta = x - s.mean; 15 s.mean += delta / s.n; 16 long double delta2 = x - s.mean; 17 s.M2 += delta * delta2; 18 } 19 20 // Merge two partial stats (Chan's formulas) 21 Stats merge(const Stats &a, const Stats &b) { 22 if (a.n == 0) return b; 23 if (b.n == 0) return a; 24 Stats out; 25 out.n = a.n + b.n; 26 long double delta = b.mean - a.mean; 27 out.mean = (a.n * a.mean + b.n * b.mean) / out.n; 28 out.M2 = a.M2 + b.M2 + delta * delta * (static_cast<long double>(a.n) * static_cast<long double>(b.n) / out.n); 29 return out; 30 } 31 32 int main() { 33 std::vector<long double> data; 34 for (int i = 1; i <= 100; ++i) data.push_back(i); // data = 1..100 35 36 // Split into two shards and compute partial stats 37 Stats A, B; 38 for (std::size_t i = 0; i < data.size(); ++i) { 39 if (i % 2 == 0) update(A, data[i]); else update(B, data[i]); 40 } 41 42 // Merge and compare with single-pass 43 Stats merged = merge(A, B); 44 45 Stats single; 46 for (auto x : data) update(single, x); 47 48 auto pop_var = [](const Stats &s){ return s.M2 / (s.n); }; 49 auto samp_var = [](const Stats &s){ return s.n > 1 ? s.M2 / (s.n - 1) : std::nanl(""); }; 50 51 std::cout.setf(std::ios::fixed); std::cout.precision(6); 52 std::cout << "Merged mean: " << static_cast<double>(merged.mean) << "\n"; 53 std::cout << "Single-pass mean: " << static_cast<double>(single.mean) << "\n"; 54 std::cout << "Merged var (pop): " << static_cast<double>(pop_var(merged)) << "\n"; 55 std::cout << "Single var (pop): " << static_cast<double>(pop_var(single)) << "\n"; 56 std::cout << "Merged var (sample): " << static_cast<double>(samp_var(merged)) << "\n"; 57 std::cout << "Single var (sample): " << static_cast<double>(samp_var(single)) << "\n"; 58 return 0; 59 } 60
This example shows how to compute statistics on data shards and merge them exactly using Chan’s formulas. The merged result matches a single-pass computation, enabling scalable parallel or distributed variance computation without accessing raw data again.