🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
∑MathIntermediate

Expectation, Variance & Moments

Key Points

  • •
    Expectation is the long-run average value of a random variable and acts like the balance point of its distribution.
  • •
    Variance measures how widely values spread around the mean, and it equals E[X2] minus (E[X])^2.
  • •
    Moments, like E[Xk], summarize the shape of a distribution; central moments around the mean capture spread, skewness, and kurtosis.
  • •
    Linearity of expectation holds without independence, but variance additivity generally requires independence.
  • •
    Numerically stable, one-pass algorithms (like Welford’s) are preferred for computing variance on real data.
  • •
    Monte Carlo simulation estimates expectations by averaging many random samples, with error shrinking like 1/n​.
  • •
    Use unbiased sample variance with n-1 in the denominator when estimating population variance from data.
  • •
    Avoid assuming E[g(X)] = g(E[X]) unless g is linear; Jensen’s inequality describes the correct relationship for convex functions.

Prerequisites

  • →Random Variables and Probability Distributions — Expectation and variance are defined with respect to distributions over possible values.
  • →Summation and Integration — Expectations are computed via sums (discrete) or integrals (continuous).
  • →Independence and Covariance — Understanding when variances add and how dependencies affect spread requires these concepts.
  • →Basic Calculus — Integration, derivatives, and series underpin moments and MGFs.
  • →Floating-Point Arithmetic — Practical computation of moments demands awareness of numerical stability and rounding error.
  • →Algorithmic Complexity — Evaluating one-pass versus two-pass computations and Monte Carlo scaling requires complexity analysis.

Detailed Explanation

Tap terms for definitions

01Overview

Expectation, variance, and moments are core tools for describing and reasoning about randomness. Expectation (E[X]) is the long-run average value you’d get if you could repeat an experiment indefinitely. Variance (Var(X)) tells you how much variability there is around that average. Moments generalize these ideas: the k-th moment E[X^k] captures aspects of the distribution’s shape—second moment relates to spread, third to asymmetry (skewness), and fourth to tail heaviness (kurtosis). A foundational identity connects variance to moments: Var(X) = E[X^2] - (E[X])^2. This identity is widely used in both theory and practice because it converts a measure of spread into expectations that are often easier to compute. In computer science and data practice, these quantities guide algorithm analysis (expected runtime), simulation (Monte Carlo methods), and data summarization. For instance, linearity of expectation simplifies the analysis of randomized algorithms even when components are dependent. Practically, we estimate these values from data streams or simulations using numerically stable one-pass methods that avoid precision loss. Understanding when these measures exist (finite integrals/sums) and how to compute them robustly is crucial for building reliable software that interacts with randomness, from A/B testing to probabilistic data structures and machine learning.

02Intuition & Analogies

Imagine a seesaw (teeter-totter) with weights placed at different positions along a line. The point where the seesaw balances is like the expectation: it’s the weighted average of positions, with probabilities acting as weights. If the weights are spread far from the balance point, the seesaw is harder to balance—that spread corresponds to variance. Now think of throwing darts at a target. The center of the cluster is the expectation; the tightness of the cluster is related to variance. If most darts are close to the center, variance is small; if they’re all over the place, variance is large. Moments refine this picture. The second raw moment E[X^2] measures average squared distance from zero, while the variance E[(X - E[X])^2] measures squared distance from the mean. The third central moment captures whether the cluster leans to one side (skewness): more mass on the right yields positive skew. The fourth central moment gauges how heavy the tails are (kurtosis): more extreme outliers increase it. Moments are like a multi-lens camera focusing on different aspects of shape: average (first), spread (second), tilt (third), and tail weight (fourth). For computation, think of expectation as an average you can estimate by sampling repeatedly and averaging results. Variance can be found either by computing E[X^2] and E[X] and using Var(X) = E[X^2] - (E[X])^2, or more stably by tracking how each new observation changes the mean and the sum of squared deviations (Welford’s algorithm). The sampling perspective explains why Monte Carlo error shrinks like 1/sqrt(n): more samples average out randomness.

03Formal Definition

Let X be a real-valued random variable. In the discrete case with values xi​ and probabilities pi​, the expectation is E[X] = ∑i​ xi​ pi​, provided the series converges absolutely. In the continuous case with density fX​(x), the expectation is E[X] = ∫−∞∞​ x fX​(x) \, dx, provided the integral exists (finite). The k-th raw moment is mk​=E[Xk]; the k-th central moment is μk​ = E[(X - E[X])^k], when these expectations are finite. Variance is the second central moment: Var(X) = E[(X - E[X])^2]. Expanding the square yields Var(X) = E[X2] - (E[X])^2, a key identity that often simplifies computation. For affine transformations, if Y=aX+b, then E[Y] = aE[X] + b and Var(Y) = a2 Var(X). For two variables X and Y with finite second moments, covariance is Cov(X, Y) = E[XY] - E[X]E[Y]. If X and Y are independent, then E[XY] = E[X]E[Y] and Var(X + Y) = Var(X) + Var(Y). The moment generating function (when it exists near 0) is MX​(t) = E[etX], and its k-th derivative at 0 gives the k-th raw moment: M_X(k)(0) = E[Xk]. Moments may not exist for heavy-tailed distributions; care must be taken to verify finiteness.

04When to Use

Use expectation when you need the typical or average outcome of a random process: expected runtime of an algorithm, expected reward in decision-making, or average value in simulation. Leverage linearity of expectation to simplify sums of many random components, even if they are dependent—for example, analyzing the expected number of collisions in hashing. Use variance when quantifying uncertainty or risk. In A/B testing, a smaller variance leads to tighter confidence intervals. In algorithm analysis, variance helps bound tail probabilities via Chebyshev’s inequality. When combining independent random effects (e.g., sensor noise), variances add, enabling error budgeting. Use moments for deeper shape diagnostics. The second moment underpins energy computations (e.g., Parseval-like contexts), the third and fourth central moments inform skewness and kurtosis, and raw moments are essential in series expansions or method-of-moments parameter estimation. Moment generating functions, when they exist, uniquely characterize distributions and make it easy to compute moments by differentiation. In implementation, choose numerically stable one-pass algorithms (Welford/Chan) for online or streaming data and when memory is limited. Use Monte Carlo averages to estimate expectations of complex functions where analytical integration is infeasible. Prefer unbiased sample variance (with n-1) when estimating a population variance from i.i.d. samples.

⚠️Common Mistakes

• Confusing sample vs population variance: Using 1/n instead of 1/(n-1) when estimating variance from sample data produces a biased estimator; use the unbiased sample variance with n-1 for i.i.d. samples. • Assuming independence to add variances: Var(X + Y) = Var(X) + Var(Y) only if X and Y are independent (or uncorrelated with additional conditions). Otherwise, a covariance term appears. • Believing E[g(X)] = g(E[X]) for nonlinear g: This is false except for linear functions. Jensen’s inequality tells you the correct direction for convex/concave g. • Numerical instability: Computing variance via two-pass subtraction Var ≈ (\overline{x^2} - \bar{x}^2) can suffer catastrophic cancellation when the mean is large relative to the spread. Use Welford’s algorithm or compensated summation. • Ignoring existence conditions: Moments may diverge for heavy-tailed distributions (e.g., Cauchy has undefined mean and variance). Always check finiteness before relying on identities. • Misinterpreting units: Variance has squared units; standard deviation (the square root) returns to the original units and is often more interpretable. • Small-sample overconfidence: Monte Carlo error decreases like 1/\sqrt{n}; with few samples, spread estimates are very noisy. Report standard errors and confidence intervals. • Mixing weights incorrectly: For weighted data, both mean and variance must use the same weights; do not reuse unweighted formulas on weighted samples.

Key Formulas

Discrete Expectation

E[X]=i∑​xi​pi​

Explanation: The expectation of a discrete random variable is the weighted average of its values using their probabilities. It exists when the series converges absolutely.

Continuous Expectation

E[X]=∫−∞∞​xfX​(x)dx

Explanation: For a continuous variable with density fX​, the expectation is the integral of x times the density over the real line. It must be finite to be meaningful.

Raw and Central Moments

mk​=E[Xk],μk​=E[(X−E[X])k]

Explanation: Raw moments are expectations of powers of X; central moments are expectations of powers of deviations from the mean. They capture shape features like spread, skewness, and tails.

Variance Identity

Var(X)=E[(X−E[X])2]=E[X2]−(E[X])2

Explanation: Variance equals the expected squared deviation from the mean and can be computed as the second raw moment minus the square of the mean. This identity is widely used for analysis and computation.

Linearity of Expectation

E[aX+b]=aE[X]+b

Explanation: Expectation distributes over addition and scales by constants, regardless of whether variables are independent. This property simplifies many analyses.

Affine Variance

Var(aX+b)=a2Var(X)

Explanation: Shifting by b does not change variance; scaling by a multiplies variance by a2. This follows from the definition of variance.

Covariance

Cov(X,Y)=E[XY]−E[X]E[Y]

Explanation: Covariance measures joint variability. If X and Y are independent with finite second moments, then E[XY] = E[X]E[Y] and covariance is zero.

Variance of a Sum

Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)

Explanation: Variance of a sum includes variances plus twice the covariance. If X and Y are independent, the covariance term vanishes.

Moment Generating Function

MX​(t)=E[etX],MX(k)​(0)=E[Xk]

Explanation: The MGF, when it exists near zero, encodes all raw moments as derivatives at zero. It can uniquely identify distributions.

Law of Total Expectation

E[X]=E[E[X∣Y]]

Explanation: The expected value can be computed by conditioning on another variable and then averaging. This is useful for hierarchical models and staged experiments.

Chebyshev’s Inequality

P(∣X−μ∣≥kσ)≤k21​

Explanation: The probability of deviating from the mean by k standard deviations is at most 1/k2. It holds for any distribution with finite variance.

Jensen’s Inequality

g convex⟹g(E[X])≤E[g(X)]

Explanation: A convex function of the mean is no larger than the mean of the function. It explains why E[g(X)] generally differs from g(E[X]) for nonlinear g.

Sample Mean and Unbiased Variance

Xˉ=n1​i=1∑n​xi​,s2=n−11​i=1∑n​(xi​−Xˉ)2

Explanation: These are standard estimators for population mean and variance from i.i.d. data. The n-1 factor removes bias in variance estimation.

Variance of the Sample Mean

Var(Xˉ)=nVar(X)​

Explanation: Averaging n i.i.d. samples reduces variance by a factor of n. The standard error of the mean is Var(X)/n​.

Complexity Analysis

Computing empirical expectation and variance from n observations can be done in a single pass with O(n) time and O(1) extra space using Welford’s algorithm. Each update performs a constant number of floating-point operations: compute a deviation from the current mean, update the mean, and accumulate the sum of squared deviations (M2). This approach avoids the need to store the full dataset and prevents catastrophic cancellation inherent in the naive two-pass method that separately computes the mean and the mean of squares. If you also need higher moments (e.g., skewness, kurtosis), you can extend the state with additional accumulators while maintaining O(1) space per stream. For Monte Carlo estimation of E[g(X)], generating n i.i.d. samples and averaging g(Xi​) takes O(n) time, dominated by random number generation and function evaluation. The statistical error (standard deviation) of the estimator decreases like O(1/n​), so reducing error by a factor of 10 typically requires 100 times more samples, which directly impacts runtime. Memory again remains O(1) if you aggregate on the fly. In parallel or distributed settings, you can compute partial statistics on shards and merge them via Chan’s formulas. The merge is O(1) per pair of partials, so a tree reduction over p partitions costs O(p) time with good parallel speedup. Numerically, merging preserves stability because it operates on means and M2 rather than raw sums of squares. Overall, these methods scale linearly with data size and are well-suited to streaming and big-data contexts.

Code Examples

One-pass, numerically stable mean and variance (Welford)
1#include <iostream>
2#include <vector>
3#include <cmath>
4#include <limits>
5
6// A class that computes mean and variance in one pass using Welford's algorithm.
7// It also exposes population and unbiased sample variance.
8class OnlineStats {
9public:
10 OnlineStats() : n_(0), mean_(0.0), M2_(0.0) {}
11
12 // Add a new observation x
13 void add(double x) {
14 ++n_;
15 double delta = x - mean_;
16 mean_ += delta / static_cast<double>(n_);
17 double delta2 = x - mean_;
18 M2_ += delta * delta2; // Accumulate sum of squared deviations
19 }
20
21 // Merge another OnlineStats into this one (Chan's method)
22 void merge(const OnlineStats& other) {
23 if (other.n_ == 0) return;
24 if (n_ == 0) { *this = other; return; }
25 double n_total = static_cast<double>(n_ + other.n_);
26 double delta = other.mean_ - mean_;
27 M2_ = M2_ + other.M2_ + delta * delta * (static_cast<double>(n_) * static_cast<double>(other.n_) / n_total);
28 mean_ = (static_cast<double>(n_) * mean_ + static_cast<double>(other.n_) * other.mean_) / n_total;
29 n_ += other.n_;
30 }
31
32 // Number of observations
33 std::size_t count() const { return n_; }
34
35 // Mean of the observations
36 double mean() const { return mean_; }
37
38 // Population variance: Var = M2 / n (defined if n > 0)
39 double variance_population() const { return (n_ > 0) ? (M2_ / static_cast<double>(n_)) : std::numeric_limits<double>::quiet_NaN(); }
40
41 // Unbiased sample variance: s^2 = M2 / (n - 1) (defined if n > 1)
42 double variance_unbiased() const { return (n_ > 1) ? (M2_ / static_cast<double>(n_ - 1)) : std::numeric_limits<double>::quiet_NaN(); }
43
44private:
45 std::size_t n_; // count
46 double mean_; // running mean
47 double M2_; // sum of squared deviations from the mean
48};
49
50int main() {
51 // Example data: imagine these are i.i.d. measurements
52 std::vector<double> data = {2.0, 4.0, 4.0, 4.0, 5.0, 5.0, 7.0, 9.0};
53
54 OnlineStats stats;
55 for (double x : data) stats.add(x);
56
57 std::cout << "Count: " << stats.count() << "\n";
58 std::cout << "Mean: " << stats.mean() << "\n";
59 std::cout << "Var (population): " << stats.variance_population() << "\n";
60 std::cout << "Var (unbiased sample): " << stats.variance_unbiased() << "\n";
61
62 // Verify Var = E[X^2] - (E[X])^2 using the same data (two-pass for demonstration only)
63 double mean = 0.0, mean2 = 0.0;
64 for (double x : data) { mean += x; mean2 += x * x; }
65 mean /= data.size();
66 mean2 /= data.size();
67 double var_identity = mean2 - mean * mean;
68
69 std::cout << "Var via identity E[X^2] - (E[X])^2: " << var_identity << "\n";
70 return 0;
71}
72

This program computes the mean and variance in a single pass using Welford’s algorithm, which is numerically stable. It also demonstrates the identity Var(X) = E[X^2] - (E[X])^2 using a simple two-pass calculation for verification. The merge method anticipates parallel aggregation (used in the third example).

Time: O(n)Space: O(1)
Monte Carlo estimation of E[X], E[X^2], and Var(X) for a standard normal
1#include <iostream>
2#include <random>
3#include <cmath>
4#include <limits>
5
6// Estimate expectations by averaging many random samples.
7int main() {
8 std::mt19937_64 rng(123456789ULL); // Fixed seed for reproducibility
9 std::normal_distribution<double> dist(0.0, 1.0); // Standard normal N(0,1)
10
11 const std::size_t n = 1000000; // Number of samples (1e6)
12
13 long double sum = 0.0L; // For E[X]
14 long double sumsq = 0.0L; // For E[X^2]
15 long double sum4 = 0.0L; // For E[X^4] (should be 3 for N(0,1))
16
17 for (std::size_t i = 0; i < n; ++i) {
18 double x = dist(rng);
19 sum += x;
20 sumsq += x * x;
21 double x2 = x * x;
22 sum4 += x2 * x2;
23 }
24
25 long double EX = sum / n;
26 long double EX2 = sumsq / n;
27 long double EX4 = sum4 / n;
28 long double VarX = EX2 - EX * EX; // Identity Var = E[X^2] - (E[X])^2
29
30 // Standard error estimates for the mean and for E[X^2]
31 long double s2_for_mean = VarX; // Var(X)
32 long double se_mean = std::sqrt(s2_for_mean / n);
33
34 // For E[X^2], variance is Var(X^2)/n; estimate Var(X^2) using E[X^4] - (E[X^2])^2
35 long double VarX2 = EX4 - EX2 * EX2;
36 long double se_EX2 = std::sqrt(VarX2 / n);
37
38 std::cout.setf(std::ios::fixed); std::cout.precision(6);
39 std::cout << "Monte Carlo with n = " << n << "\n";
40 std::cout << "E[X] ≈ " << static_cast<double>(EX) << " (theory: 0)\n";
41 std::cout << "E[X^2] ≈ " << static_cast<double>(EX2) << " (theory: 1)\n";
42 std::cout << "Var(X) ≈ " << static_cast<double>(VarX) << " (theory: 1)\n";
43 std::cout << "E[X^4] ≈ " << static_cast<double>(EX4) << " (theory: 3)\n";
44 std::cout << "SE(E[X]) ≈ " << static_cast<double>(se_mean) << "\n";
45 std::cout << "SE(E[X^2]) ≈ " << static_cast<double>(se_EX2) << "\n";
46 return 0;
47}
48

This Monte Carlo program estimates E[X], E[X^2], Var(X), and E[X^4] for X ~ N(0,1) using 1e6 samples. It verifies Var(X) = E[X^2] - (E[X])^2 and reports standard errors that shrink as 1/sqrt(n). Using long double accumulators reduces rounding error in large sums.

Time: O(n)Space: O(1)
Merging partial means and variances (Chan’s parallel formulas)
1#include <iostream>
2#include <vector>
3#include <cmath>
4
5struct Stats {
6 std::size_t n = 0;
7 long double mean = 0.0L;
8 long double M2 = 0.0L; // sum of squared deviations
9};
10
11// Update with a new value (Welford step)
12void update(Stats &s, long double x) {
13 s.n += 1;
14 long double delta = x - s.mean;
15 s.mean += delta / s.n;
16 long double delta2 = x - s.mean;
17 s.M2 += delta * delta2;
18}
19
20// Merge two partial stats (Chan's formulas)
21Stats merge(const Stats &a, const Stats &b) {
22 if (a.n == 0) return b;
23 if (b.n == 0) return a;
24 Stats out;
25 out.n = a.n + b.n;
26 long double delta = b.mean - a.mean;
27 out.mean = (a.n * a.mean + b.n * b.mean) / out.n;
28 out.M2 = a.M2 + b.M2 + delta * delta * (static_cast<long double>(a.n) * static_cast<long double>(b.n) / out.n);
29 return out;
30}
31
32int main() {
33 std::vector<long double> data;
34 for (int i = 1; i <= 100; ++i) data.push_back(i); // data = 1..100
35
36 // Split into two shards and compute partial stats
37 Stats A, B;
38 for (std::size_t i = 0; i < data.size(); ++i) {
39 if (i % 2 == 0) update(A, data[i]); else update(B, data[i]);
40 }
41
42 // Merge and compare with single-pass
43 Stats merged = merge(A, B);
44
45 Stats single;
46 for (auto x : data) update(single, x);
47
48 auto pop_var = [](const Stats &s){ return s.M2 / (s.n); };
49 auto samp_var = [](const Stats &s){ return s.n > 1 ? s.M2 / (s.n - 1) : std::nanl(""); };
50
51 std::cout.setf(std::ios::fixed); std::cout.precision(6);
52 std::cout << "Merged mean: " << static_cast<double>(merged.mean) << "\n";
53 std::cout << "Single-pass mean: " << static_cast<double>(single.mean) << "\n";
54 std::cout << "Merged var (pop): " << static_cast<double>(pop_var(merged)) << "\n";
55 std::cout << "Single var (pop): " << static_cast<double>(pop_var(single)) << "\n";
56 std::cout << "Merged var (sample): " << static_cast<double>(samp_var(merged)) << "\n";
57 std::cout << "Single var (sample): " << static_cast<double>(samp_var(single)) << "\n";
58 return 0;
59}
60

This example shows how to compute statistics on data shards and merge them exactly using Chan’s formulas. The merged result matches a single-pass computation, enabling scalable parallel or distributed variance computation without accessing raw data again.

Time: O(n)Space: O(1)
#expectation#variance#moments#central moments#mean#skewness#kurtosis#covariance#monte carlo#welford algorithm#chan variance#jensen inequality#chebyshev inequality#mgf#sample variance