Random Variables & Distributions
Key Points
- ā¢A random variable maps uncertain outcomes to numbers and is described by a distribution that assigns likelihoods to values or ranges.
- ā¢Discrete random variables use a probability mass function (PMF), while continuous random variables use a probability density function (PDF).
- ā¢The cumulative distribution function (CDF) works for both types and gives P(), which is the area under the PMF/PDF up to x.
- ā¢Expectation is the long-run average value (a weighted sum or integral), and variance measures spread around the mean.
- ā¢You can sample discrete distributions with prefix sums and binary search, and continuous ones via inverse transform or BoxāMuller for normals.
- ā¢Monte Carlo simulation estimates expectations and probabilities by averaging many random samples.
- ā¢Always ensure probabilities sum (or integrate) to 1 and remember that a PDFās value is not a probability by itself.
- ā¢Careful seeding and numeric stability (floating-point issues) are essential for correct simulation results in C++.
Prerequisites
- āBasic Probability ā Understanding events, sample spaces, and conditional probability is essential to define random variables and distributions.
- āSingle-Variable Calculus ā Integrals are needed to compute probabilities and expectations for continuous distributions.
- āSequences and Limits ā Convergence concepts underpin the Law of Large Numbers and Central Limit Theorem.
- āC++ Programming Fundamentals ā You must know how to compile, use standard libraries, and manage types and I/O to implement samplers.
- āRandom Number Generation ā Using PRNGs (mt19937), seeding, and uniform distributions is the basis for sampling more complex distributions.
- āNumerical Stability and Floating-Point ā Simulations rely on floating-point arithmetic; avoiding underflow/overflow and precision loss is important.
- āData Structures (Arrays, Binary Search) ā Efficient discrete sampling uses prefix arrays and binary search.
Detailed Explanation
Tap terms for definitions01Overview
A random variable is a numerical view of uncertainty. Instead of saying āthe die shows a face,ā we say āX = 1, 2, ..., 6ā with certain probabilities. The distribution of a random variable specifies how likely each outcome is. For discrete random variables, the distribution is given by a probability mass function (PMF) that assigns a probability to each possible value. For continuous random variables, we use a probability density function (PDF), which describes how probability is distributed over the real line; probabilities are obtained by integrating the PDF over intervals. A unifying tool is the cumulative distribution function (CDF), which gives the probability that the variable is less than or equal to a given value. Random variables are the building blocks for statistical modeling, machine learning, and randomized algorithms. They let us quantify expectations, risks, and variability. For example, the number of emails you receive in an hour could be modeled by a Poisson random variable (discrete), while the time you wait for the next bus might be modeled by an exponential random variable (continuous). In practice, we simulate random variables to test algorithms, approximate integrals (Monte Carlo), and fit models to data. C++ provides powerful pseudo-random number generators and distributions, and understanding the math behind random variables helps you use these tools correctly and efficiently.
02Intuition & Analogies
Think of a random variable as a measuring tape you apply to uncertainty. When you roll a die, you could measure āthe face value,ā giving numbers 1 through 6āthis is a discrete random variable. When you measure the waiting time until the next bus, you get any real number, like 3.7 minutesāthis is continuous. The distribution is a recipe telling you how likely each measurement is. For a discrete case, the recipe lists ingredients with exact amounts (probabilities for 1, 2, ..., 6). For a continuous case, itās more like a density of sand spread along a line; any single point has no sand by itself, but intervals collect sand (probability) proportional to the area under the curve. The CDF is like a running total: slide from left to right and watch how much probability has accumulated so far. The expectation is the balance point (center of mass) of the distributionāif you placed weights at each possible value proportional to their probabilities, the expectation is where the seesaw balances. Variance captures how far values typically sit from the center: tightly clustered distributions have small variance; widely spread ones have large variance. Sampling is like drawing marbles from a jar without seeing inside: each draw follows the distributionās recipe. For discrete distributions, you can label segments on a 0ā1 ruler with lengths equal to probabilities; picking a uniform point then lands in one segment. For continuous distributions, the inverse transform method bends the 0ā1 ruler using the CDF so that uniform draws map to the correct values. For normals, BoxāMuller cleverly transforms two uniforms into bell-shaped numbers. These intuitive pictures guide both theory and practical algorithms.
03Formal Definition
04When to Use
Use discrete random variables when outcomes are countable: dice, number of website clicks, defect counts, or successes in n trials (binomial). Theyāre ideal for modeling counts, categories, and yes/no processes. Use continuous random variables when measurements can vary smoothly: heights, times, temperatures, or distances. Exponential models are natural for waiting times; normal distributions model aggregated noise and measurement errors. In algorithm design, random variables and their distributions arise in randomized algorithms (e.g., quicksort pivoting), hashing analysis, and probabilistic data structures (Bloom filters). In simulation, youāll sample from distributions to estimate expectations (Monte Carlo integration), evaluate risk (VaR in finance), or test system performance (queues, A/B testing). The CDF and quantile (inverse CDF) are used for computing probabilities, thresholds (percentiles), and generating samples via inverse transform sampling. Choose hand-rolled samplers when you need control, portability, or custom distributions; use library distributions (like std::normal_distribution) for speed and reliability. For large discrete supports with many repeated samples, consider O(1) alias sampling after O(n) preprocessing. For heavy-tailed or complex continuous distributions, use transformations, rejection sampling, or Markov chain Monte Carlo (MCMC).
ā ļøCommon Mistakes
⢠Confusing PMF and PDF: a PDF value f(x) is not a probability. Only integrals over intervals yield probabilities. For discrete variables, p(x) itself is a probability. ⢠Forgetting normalization: discrete probabilities must sum to 1 and continuous densities must integrate to 1. In code, floating-point rounding can cause slight drift; renormalize and clamp the final prefix to 1. ⢠Bad RNG usage: seeding mt19937 with time repeatedly inside loops or using rand() % n creates bias. Seed once with std::random_device and use uniform_real_distribution for [0,1). ⢠Off-by-one in CDF prefix arrays: upper_bound vs lower_bound choices matter. With a uniform u in [0,1), using upper_bound on prefix CDF selects the correct index; ensure the last prefix entry is exactly 1 within epsilon. ⢠Misinterpreting variance: standard deviation is the square root of variance; comparing spreads should use the same unit as the data (std. dev.), not squared units. ⢠Ignoring dependence: applying formulas (e.g., Var(X+Y) = Var(X)+Var(Y)) without independence assumptions leads to wrong results. ⢠Overfitting histograms: too many bins create noise; too few hide structure. Choose bin counts proportional to N^{1/3} or use rules (FreedmanāDiaconis). ⢠Numerical instability in tails: computing 1 - F(x) for large x can lose precision. Prefer survival functions or log-space computations when available.
Key Formulas
CDF
Explanation: The CDF gives the probability that X is at most x. It fully characterizes the distribution for both discrete and continuous cases.
PMF Normalization
Explanation: For a discrete random variable with support S, probabilities across all possible values must sum to 1. This ensures a valid distribution.
PDF Normalization
Explanation: A continuous random variableās density must integrate to 1 over the real line. Nonnegativity ensures probabilities are not negative.
CDF from PMF/PDF
Explanation: The CDF accumulates probability up to x. For discrete variables it is a sum; for continuous variables it is an integral.
Expectation
Explanation: The expectation is the long-run average value of X. It is a weighted sum or integral of values using their probabilities or densities.
Variance
Explanation: Variance measures spread around the mean. Computing it via E[] avoids directly handling squared deviations in data streams.
Linearity of Expectation
Explanation: Expectation distributes over linear combinations regardless of independence. This is a key tool in algorithm analysis.
Independence (Continuous)
Explanation: For independent continuous random variables, their joint density factorizes. The same holds for PMFs in the discrete case.
Marginalization
Explanation: You obtain the distribution of a subset of variables by summing or integrating out the others. This is used in Bayes nets and probabilistic models.
Quantile Function
Explanation: The quantile maps a probability u in [0,1] to the corresponding threshold x. Inverse transform sampling draws X by taking ) for U Uniform(0,1).
Exponential Inverse CDF
Explanation: Sampling from an exponential distribution with rate Ī» can be done by transforming a uniform random number using this closed form.
Normal PDF
Explanation: This bell-shaped density is ubiquitous due to the Central Limit Theorem. It is parameterized by mean μ and standard deviation
Binomial PMF
Explanation: Models the number of successes in n independent Bernoulli trials with success probability p. Mean is np and variance is np(1-p).
Poisson PMF
Explanation: Models counts of rare events over fixed intervals with rate Mean and variance are both
Chebyshev's Inequality
Explanation: Provides a distribution-free bound on deviations of the sample mean from the true mean. Useful for conservative guarantees.
Central Limit Theorem
Explanation: As n grows, the standardized sample mean approaches a standard normal distribution. This underlies many statistical approximations.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct DiscreteRV { 5 vector<double> prefix; // prefix[i] = P(X <= i) 6 explicit DiscreteRV(vector<double> probs) { 7 if (probs.empty()) throw runtime_error("Empty probability vector"); 8 // Ensure non-negative and finite 9 for (double p : probs) { 10 if (!(p >= 0.0) || !isfinite(p)) throw runtime_error("Invalid probability"); 11 } 12 // Normalize to sum to 1 (robust against rounding) 13 long double sum = 0.0L; 14 for (double p : probs) sum += p; 15 if (sum <= 0.0L) throw runtime_error("Sum of probabilities must be positive"); 16 vector<double> pn; 17 pn.reserve(probs.size()); 18 for (double p : probs) pn.push_back(static_cast<double>(p / sum)); 19 // Build prefix CDF; clamp the last entry to 1 exactly 20 prefix.resize(pn.size()); 21 partial_sum(pn.begin(), pn.end(), prefix.begin()); 22 prefix.back() = 1.0; // guard against tiny drift like 0.9999999997 23 } 24 // Return P(X = k) 25 double pmf(size_t k) const { 26 if (k >= prefix.size()) return 0.0; 27 double prev = (k == 0 ? 0.0 : prefix[k-1]); 28 return max(0.0, prefix[k] - prev); 29 } 30 // Return P(X <= k) 31 double cdf(size_t k) const { 32 if (k >= prefix.size()) return 1.0; 33 return prefix[k]; 34 } 35 // Sample using u ~ Uniform[0,1) 36 size_t sample(mt19937 &gen) const { 37 uniform_real_distribution<double> U(0.0, 1.0); 38 double u = U(gen); 39 // upper_bound finds first prefix[i] > u, which is the correct index 40 auto it = upper_bound(prefix.begin(), prefix.end(), u); 41 size_t idx = static_cast<size_t>(it - prefix.begin()); 42 if (idx >= prefix.size()) idx = prefix.size() - 1; // safety 43 return idx; 44 } 45 }; 46 47 int main() { 48 ios::sync_with_stdio(false); 49 cin.tie(nullptr); 50 51 // Loaded die probabilities for faces 1..6 52 vector<double> probs = {0.05, 0.10, 0.20, 0.25, 0.25, 0.15}; 53 DiscreteRV die(probs); 54 55 // RNG: seed once 56 random_device rd; 57 mt19937 gen(rd()); 58 59 // Empirical test 60 const int N = 200000; // number of samples 61 vector<int> counts(6, 0); 62 for (int i = 0; i < N; ++i) { 63 size_t face_idx = die.sample(gen); // 0..5 64 ++counts[face_idx]; 65 } 66 67 cout << fixed << setprecision(4); 68 cout << "Face\tPMF(th)\tPMF(emp)\tCDF(th)\n"; 69 for (size_t k = 0; k < 6; ++k) { 70 double pmf_th = die.pmf(k); 71 double pmf_emp = static_cast<double>(counts[k]) / N; 72 double cdf_th = die.cdf(k); 73 cout << (k+1) << '\t' << pmf_th << '\t' << pmf_emp << '\t' << cdf_th << "\n"; 74 } 75 76 return 0; 77 } 78
We model a loaded die with a discrete distribution. The constructor normalizes probabilities and builds a prefix CDF. Sampling draws a uniform u in [0,1) and finds the smallest index with prefix[i] > u via upper_bound, which selects an outcome with the correct probability. We compare theoretical PMF/CDF to empirical frequencies from many samples.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct ExponentialRV { 5 double lambda; // rate > 0 6 explicit ExponentialRV(double lam) : lambda(lam) { 7 if (!(lambda > 0.0) || !isfinite(lambda)) throw runtime_error("lambda must be positive and finite"); 8 } 9 // PDF and CDF 10 double pdf(double x) const { return (x < 0 ? 0.0 : lambda * exp(-lambda * x)); } 11 double cdf(double x) const { return (x < 0 ? 0.0 : 1.0 - exp(-lambda * x)); } 12 // Quantile (inverse CDF). Use 1-u to avoid log(0) when u=0. 13 double inv_cdf(double u) const { 14 if (!(u >= 0.0 && u <= 1.0)) throw runtime_error("u must be in [0,1]"); 15 if (u == 1.0) return numeric_limits<double>::infinity(); 16 return -log(1.0 - u) / lambda; 17 } 18 // Sample using inverse transform with U ~ Uniform(0,1) 19 double sample(mt19937 &gen) const { 20 uniform_real_distribution<double> U(0.0, 1.0); 21 double u = U(gen); 22 // ensure u in (0,1) to avoid returning -log(1) = -0 23 if (u == 0.0) u = nextafter(0.0, 1.0); 24 return inv_cdf(u); 25 } 26 }; 27 28 int main() { 29 ios::sync_with_stdio(false); 30 cin.tie(nullptr); 31 32 double lambda = 2.0; // mean = 1/lambda = 0.5 33 ExponentialRV expo(lambda); 34 35 random_device rd; 36 mt19937 gen(rd()); 37 38 const int N = 200000; // samples 39 double sum = 0.0, sumsq = 0.0; 40 int count_gt_1 = 0; 41 for (int i = 0; i < N; ++i) { 42 double x = expo.sample(gen); 43 sum += x; 44 sumsq += x * x; 45 if (x > 1.0) ++count_gt_1; 46 } 47 48 double mean_emp = sum / N; 49 double var_emp = sumsq / N - mean_emp * mean_emp; 50 51 cout << fixed << setprecision(6); 52 cout << "Theoretical mean: " << 1.0 / lambda << "\n"; 53 cout << "Empirical mean: " << mean_emp << "\n"; 54 cout << "Theoretical P(X>1): " << exp(-lambda * 1.0) << "\n"; 55 cout << "Empirical P(X>1): " << static_cast<double>(count_gt_1) / N << "\n"; 56 cout << "Empirical variance (for reference): " << var_emp << " (theoretical = " << 1.0/(lambda*lambda) << ")\n"; 57 58 return 0; 59 } 60
This program samples from an exponential distribution using inverse transform sampling: if U ~ Uniform(0,1), then X = F^{-1}(U) = -ln(1-U)/Ī» has the desired distribution. We verify the mean and a tail probability against their theoretical values and report empirical variance.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct NormalBM { 5 double mu, sigma; 6 bool has_spare = false; // cache the second variate 7 double spare = 0.0; 8 explicit NormalBM(double m, double s) : mu(m), sigma(s) { 9 if (!(sigma > 0.0) || !isfinite(mu) || !isfinite(sigma)) throw runtime_error("Invalid parameters"); 10 } 11 double sample(mt19937 &gen) { 12 if (has_spare) { has_spare = false; return mu + sigma * spare; } 13 uniform_real_distribution<double> U(0.0, 1.0); 14 double u1 = U(gen); 15 double u2 = U(gen); 16 // Avoid log(0) 17 if (u1 == 0.0) u1 = nextafter(0.0, 1.0); 18 double r = sqrt(-2.0 * log(u1)); 19 double theta = 2.0 * M_PI * u2; 20 double z0 = r * cos(theta); 21 double z1 = r * sin(theta); 22 spare = z1; has_spare = true; 23 return mu + sigma * z0; 24 } 25 static double pdf(double x, double mu, double sigma) { 26 const double inv = 1.0 / (sigma * sqrt(2.0 * M_PI)); 27 double z = (x - mu) / sigma; 28 return inv * exp(-0.5 * z * z); 29 } 30 }; 31 32 int main() { 33 ios::sync_with_stdio(false); 34 cin.tie(nullptr); 35 36 double mu = 0.0, sigma = 1.0; // standard normal 37 NormalBM norm(mu, sigma); 38 random_device rd; mt19937 gen(rd()); 39 40 const int N = 200000; // samples 41 const int B = 21; // histogram bins 42 double lo = mu - 4*sigma, hi = mu + 4*sigma; 43 double width = (hi - lo) / B; 44 45 vector<int> hist(B, 0); 46 double sum = 0.0, sumsq = 0.0; 47 48 for (int i = 0; i < N; ++i) { 49 double x = norm.sample(gen); 50 sum += x; sumsq += x * x; 51 if (x >= lo && x < hi) { 52 int b = static_cast<int>((x - lo) / width); 53 if (b == B) b = B - 1; // edge case 54 ++hist[b]; 55 } 56 } 57 58 double mean_emp = sum / N; 59 double var_emp = sumsq / N - mean_emp * mean_emp; 60 61 cout << fixed << setprecision(6); 62 cout << "Empirical mean: " << mean_emp << ", variance: " << var_emp << " (theoretical: 0, 1)\n\n"; 63 64 cout << "bin_center\temp_pdf\tth_pdf\n"; 65 for (int b = 0; b < B; ++b) { 66 double center = lo + (b + 0.5) * width; 67 double emp_pdf = static_cast<double>(hist[b]) / (N * width); // normalize counts to density 68 double th_pdf = NormalBM::pdf(center, mu, sigma); 69 cout << center << '\t' << emp_pdf << '\t' << th_pdf << '\n'; 70 } 71 72 return 0; 73 } 74
BoxāMuller transforms two independent uniforms into two independent standard normal variables. We cache one variate for efficiency. A histogram normalized by bin width approximates the PDF; we print empirical vs. theoretical densities at bin centers and report empirical mean/variance.