Bootstrap & Resampling Methods
Key Points
- •Bootstrap is a resampling method that estimates uncertainty by repeatedly sampling with replacement from the observed data.
- •It builds an empirical distribution of a statistic (like the mean or median) without assuming a specific underlying distribution.
- •Percentile bootstrap confidence intervals take the lower and upper percentiles of the bootstrap replicates as the interval bounds.
- •The basic (reverse) bootstrap interval adjusts the percentile bounds around the original estimate to partially correct bias.
- •Time complexity is typically O(B n), where B is the number of resamples and n is the sample size, and memory is O(B) to store replicates.
- •Use more resamples (e.g., B = 5,000–50,000) for stable intervals, especially for skewed distributions or tail quantiles.
- •Do not use naive i.i.d. bootstrap for dependent data (time series); use block bootstrap or another method instead.
- •Always fix a random seed for reproducibility and carefully compute quantiles (sorting and correct indexing).
Prerequisites
- →Basic probability and sampling — Understanding random sampling, independence, and distributions helps interpret resampling and confidence intervals.
- →Descriptive statistics (mean, median, variance) — Bootstrap estimates variability of these statistics; you must know how to compute them.
- →Sorting and selection algorithms — Percentiles require ordering values; quantile computation relies on sorting or selection.
- →C++ vectors, functions, and RNG — Implementation uses std::vector, std::function, and <random> for resampling.
- →Simple linear regression — Pairs bootstrap example requires fitting and interpreting a regression slope.
- →Monte Carlo methods — Bootstrap relies on repeated random simulation to approximate distributions.
- →Quantiles and percentiles — Confidence intervals are formed from bootstrap quantiles; careful indexing is necessary.
Detailed Explanation
Tap terms for definitions01Overview
Hook: Imagine you only have one sample of test scores, but your teacher asks, "How sure are you about the class average?" You cannot retake the test, so you need a way to estimate your uncertainty from just the data in hand. Concept: Bootstrap is a clever workaround that treats your sample as a tiny stand-in for the population. By repeatedly sampling with replacement from your observed data, you create many "fake" datasets (called bootstrap samples). For each fake dataset you recompute the statistic you care about (mean, median, slope, etc.). The spread of these recomputed values mimics how your statistic would vary if you could repeatedly sample from the real population. Example: Suppose you measured the battery life (in hours) of n phones and want a 95% confidence interval for the median. You draw B bootstrap resamples of size n from your original n values, compute the median for each resample, and then take the 2.5th and 97.5th percentiles of those medians. Those percentiles form a bootstrap confidence interval for the true median. This works even though the median’s classical standard error is hard to compute analytically. Bootstrap methods are widely used because they need far fewer assumptions than traditional formulas and are easy to implement on a computer.
02Intuition & Analogies
Think of your dataset as a jar of marbles, where each marble is one observed value. You’d like to understand how a statistic (like the mean color value) behaves across different jars drawn from the factory (the population). But you only have this one jar. The bootstrap says: if you can’t go back to the factory, pretend your jar is the factory. Shake the jar and draw n marbles with replacement to make a new jar—some marbles repeat, some get left out. Do this many times. Each new jar is a plausible alternate reality of what you could have observed. By computing your statistic on each resampled jar, you build a distribution of outcomes that reflects sampling variability. The middle 95% of these outcomes is where you’d expect your observed statistic to land most of the time if you repeated the study. That middle band becomes your confidence interval. The more resamples you generate, the smoother and more reliable this distribution becomes—much like getting a better picture by taking more photos from slightly different angles. Why does it help? Many statistics have complicated or unknown formulas for their uncertainty. The bootstrap replaces hard math with simulation: you let the computer imitate repeated experiments by reusing your data. It’s especially helpful for medians, trimmed means, regression coefficients under weird noise, or when the underlying distribution is skewed or heavy-tailed. Just remember: if your data points depend on each other (like hourly temperatures), drawing them independently from the jar breaks their structure, so you need a modified resampling strategy (e.g., block bootstrap).
03Formal Definition
04When to Use
- Complicated statistics: Use bootstrap when analytic formulas for standard errors or confidence intervals are messy or unknown (e.g., median, quantiles, trimmed means, ratios, regression with heteroskedasticity).
- Small to moderate samples: When n is not large enough for the Central Limit Theorem to give a good approximation, bootstrap often performs better—though very small samples (e.g., n < 10) can still be unreliable.
- Skewed or heavy-tailed data: Percentile or bootstrap-t intervals can adapt to asymmetry in the sampling distribution better than symmetric normal-based intervals.
- Model validation: In regression, pairs bootstrap (resample (x,y) pairs) or residual bootstrap (resample residuals) can quantify uncertainty in slopes or predictions without strict distributional assumptions.
- When not to use: If data are dependent (time series, spatial, clustered), i.i.d. bootstrap breaks dependence; instead use block/bootstrap by cluster. If data are censored or truncated, use specialized resampling schemes. With extreme outliers or highly non-smooth functionals, the percentile interval may undercover; consider studentization (bootstrap-t) or BCa adjustments.
⚠️Common Mistakes
- Resampling without replacement: That is subsampling, not the classical bootstrap. Always sample n items with replacement from the original n.
- Too few resamples (small B): Using B = 200 can yield noisy quantiles; prefer thousands (e.g., 5,000–50,000) for stable CI ends, especially at 2.5% and 97.5%.
- Ignoring dependence: Applying i.i.d. bootstrap to time series or clustered data can severely misestimate uncertainty. Use block or cluster bootstrap.
- Wrong quantile indexing: Percentiles require sorting and careful index selection or interpolation. Off-by-one errors shift coverage.
- Not fixing a seed: Without a fixed random seed, results vary between runs, making debugging and reports inconsistent.
- Confusing percentile and basic intervals: Percentile uses quantiles of \theta^{*} directly; basic reflects these around \hat{\theta}. They behave differently under bias and skewness.
- Assuming bootstrap removes bias: Bootstrap estimates variability; it doesn’t guarantee unbiased point estimates. For severe bias or boundary parameters, consider bias correction (e.g., BCa) or better estimators.
- Ignoring computation cost: For statistics that require sorting (e.g., median) or fitting models (regression), each replicate can be expensive. Plan B and consider parallelization.
Key Formulas
Empirical Distribution
Explanation: The empirical distribution assigns equal probability to each observed data point. It is the bootstrap’s stand-in for the unknown population distribution.
Bootstrap Sampling Law
Explanation: A bootstrap draw picks any observed value with equal probability 1/n. Repeating n times with replacement yields a bootstrap sample.
Bootstrap Replicate
Explanation: Each bootstrap replicate is the statistic computed on one resample. The collection of replicates approximates the sampling distribution of the statistic.
Bootstrap Variance
Explanation: The sample variance of the bootstrap replicates estimates the variance (squared standard error) of the statistic. The mean of replicates is .
Percentile CI
Explanation: Take the lower and upper quantiles of the bootstrap replicates as the confidence bounds. This leverages the empirical distribution directly.
Basic (Reverse) CI
Explanation: Reflect the percentile bounds around the original estimate to adjust for potential bias and asymmetry in the bootstrap distribution.
Bootstrap-t CI
Explanation: Studentize each replicate using its own standard error, then invert quantiles of the t*-distribution. This often improves coverage for skewed or heteroskedastic cases.
Out-of-Bag Expectation
Explanation: About 63.2% of unique observations appear in a bootstrap sample, leaving roughly 36.8% out-of-bag. This fact is useful for validation and error estimation.
Time Complexity
Explanation: Forming B resamples of size n and evaluating a simple O(n) statistic each time yields linear complexity in both B and n.
Nearest-Rank Quantile
Explanation: A simple quantile estimator picks the k-th order statistic after sorting replicates. Alternatives use linear interpolation between neighbors.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Compute mean of a vector 5 double mean(const vector<double>& v) { 6 return accumulate(v.begin(), v.end(), 0.0) / static_cast<double>(v.size()); 7 } 8 9 // Compute median of a vector (copy and sort to avoid mutating input) 10 double median(vector<double> v) { 11 size_t n = v.size(); 12 nth_element(v.begin(), v.begin() + n/2, v.end()); 13 double mid = v[n/2]; 14 if (n % 2 == 1) return mid; 15 // Need the lower middle as well for even n 16 auto it = max_element(v.begin(), v.begin() + n/2); 17 return (mid + *it) / 2.0; 18 } 19 20 // Linear interpolation quantile in [0,1] on a copy (sorts the data) 21 double quantile_linear(vector<double> x, double p) { 22 if (x.empty()) throw runtime_error("Empty data in quantile"); 23 if (p <= 0.0) return *min_element(x.begin(), x.end()); 24 if (p >= 1.0) return *max_element(x.begin(), x.end()); 25 sort(x.begin(), x.end()); 26 double h = (x.size() - 1) * p; 27 size_t lo = static_cast<size_t>(floor(h)); 28 size_t hi = static_cast<size_t>(ceil(h)); 29 double frac = h - lo; 30 return (1.0 - frac) * x[lo] + frac * x[hi]; 31 } 32 33 // Generate B bootstrap replicates of a statistic 34 vector<double> bootstrap_replicates(const vector<double>& data, 35 size_t B, 36 function<double(const vector<double>&)> stat, 37 mt19937_64& rng) { 38 const size_t n = data.size(); 39 vector<double> reps; 40 reps.reserve(B); 41 uniform_int_distribution<size_t> unif(0, n - 1); 42 vector<double> resample(n); 43 for (size_t b = 0; b < B; ++b) { 44 // Sample n indices with replacement 45 for (size_t i = 0; i < n; ++i) resample[i] = data[unif(rng)]; 46 // Compute the statistic on the resample 47 reps.push_back(stat(resample)); 48 } 49 return reps; 50 } 51 52 // Build a (1 - alpha) percentile CI from bootstrap replicates 53 pair<double,double> percentile_CI_from_reps(const vector<double>& reps, double alpha) { 54 vector<double> sorted = reps; 55 sort(sorted.begin(), sorted.end()); 56 double lower = quantile_linear(sorted, alpha / 2.0); 57 double upper = quantile_linear(sorted, 1.0 - alpha / 2.0); 58 return {lower, upper}; 59 } 60 61 int main() { 62 // Example dataset: exam scores (skewed by a few high performers) 63 vector<double> data = {58, 62, 65, 67, 69, 70, 72, 72, 74, 75, 64 76, 78, 80, 82, 85, 88, 92, 95}; 65 66 // Reproducible RNG 67 mt19937_64 rng(42); 68 69 // Number of bootstrap resamples 70 size_t B = 20000; // large B for stable tail quantiles 71 72 // Bootstrap for mean 73 auto reps_mean = bootstrap_replicates(data, B, mean, rng); 74 auto ci_mean = percentile_CI_from_reps(reps_mean, 0.05); // 95% CI 75 76 // Bootstrap for median (statistic is more expensive due to selection/sort) 77 auto reps_median = bootstrap_replicates(data, B, median, rng); 78 auto ci_median = percentile_CI_from_reps(reps_median, 0.05); 79 80 cout.setf(ios::fixed); cout << setprecision(3); 81 cout << "Sample mean: " << mean(data) << "\n"; 82 cout << "95% CI (mean, percentile): [" << ci_mean.first << ", " << ci_mean.second << "]\n"; 83 // For clarity, compute and print median once more on original data 84 cout << "Sample median:" << ' ' << median(data) << "\n"; 85 cout << "95% CI (median, percentile): [" << ci_median.first << ", " << ci_median.second << "]\n"; 86 87 return 0; 88 } 89
This program implements a generic bootstrap engine that resamples with replacement, computes a specified statistic on each resample, and forms a percentile confidence interval by taking the appropriate quantiles of the replicates. It demonstrates two statistics: the mean (O(n) per replicate) and the median (O(n) using nth_element plus a max_element for the even case, which avoids full sorting and is typically near-linear on average). A linear-interpolation quantile function provides smooth percentile estimates.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 double mean(const vector<double>& v) { 5 return accumulate(v.begin(), v.end(), 0.0) / static_cast<double>(v.size()); 6 } 7 8 double sample_sd(const vector<double>& v) { 9 size_t n = v.size(); 10 if (n < 2) return 0.0; 11 double m = mean(v); 12 double ss = 0.0; 13 for (double x : v) ss += (x - m) * (x - m); 14 return sqrt(ss / static_cast<double>(n - 1)); 15 } 16 17 // Linear interpolation quantile 18 double quantile_linear(vector<double> x, double p) { 19 if (x.empty()) throw runtime_error("Empty data in quantile"); 20 if (p <= 0.0) return *min_element(x.begin(), x.end()); 21 if (p >= 1.0) return *max_element(x.begin(), x.end()); 22 sort(x.begin(), x.end()); 23 double h = (x.size() - 1) * p; 24 size_t lo = static_cast<size_t>(floor(h)); 25 size_t hi = static_cast<size_t>(ceil(h)); 26 double frac = h - lo; 27 return (1.0 - frac) * x[lo] + frac * x[hi]; 28 } 29 30 int main() { 31 // Slightly skewed data (e.g., response times in ms) 32 vector<double> data = {102, 110, 115, 118, 120, 122, 125, 130, 132, 135, 33 140, 145, 160, 200}; 34 35 const size_t n = data.size(); 36 mt19937_64 rng(2024); 37 uniform_int_distribution<size_t> unif(0, n - 1); 38 39 // Original estimates 40 double theta_hat = mean(data); 41 double se_hat = sample_sd(data) / sqrt((double)n); 42 43 // Bootstrap-t: build distribution of t* = (mean* - mean_hat) / se* 44 size_t B = 30000; // larger B stabilizes tail quantiles of t* 45 vector<double> tstars; tstars.reserve(B); 46 vector<double> res(n); 47 48 for (size_t b = 0; b < B; ++b) { 49 for (size_t i = 0; i < n; ++i) res[i] = data[unif(rng)]; 50 double m = mean(res); 51 double se_star = sample_sd(res) / sqrt((double)n); 52 if (se_star == 0.0) continue; // all-equal resample; skip to avoid division by zero 53 double t = (m - theta_hat) / se_star; 54 tstars.push_back(t); 55 } 56 57 // Compute quantiles of t* 58 double alpha = 0.05; 59 double q_low = quantile_linear(tstars, alpha/2.0); 60 double q_high = quantile_linear(tstars, 1.0 - alpha/2.0); 61 62 // Invert to get CI 63 double lo = theta_hat - q_high * se_hat; 64 double hi = theta_hat - q_low * se_hat; 65 66 cout.setf(ios::fixed); cout << setprecision(3); 67 cout << "Mean: " << theta_hat << "\n"; 68 cout << "SE (orig): " << se_hat << "\n"; 69 cout << "95% CI (bootstrap-t): [" << lo << ", " << hi << "]\n"; 70 71 return 0; 72 } 73
This program constructs a bootstrap-t confidence interval for the mean. Each bootstrap replicate computes both the resample mean and its own standard error (s*/√n), forming t* = (mean* − mean_hat)/se*. The interval is obtained by inverting the empirical t*-quantiles and scaling by the original SE. Bootstrap-t often improves coverage for skewed or heteroskedastic data compared to the plain percentile interval.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct Fit { double slope; double intercept; }; 5 6 Fit fit_simple_linear(const vector<double>& x, const vector<double>& y) { 7 size_t n = x.size(); 8 double mx = accumulate(x.begin(), x.end(), 0.0) / n; 9 double my = accumulate(y.begin(), y.end(), 0.0) / n; 10 double num = 0.0, den = 0.0; 11 for (size_t i = 0; i < n; ++i) { 12 num += (x[i] - mx) * (y[i] - my); 13 den += (x[i] - mx) * (x[i] - mx); 14 } 15 double slope = (den == 0.0) ? 0.0 : num / den; 16 double intercept = my - slope * mx; 17 return {slope, intercept}; 18 } 19 20 // Linear interpolation quantile 21 double quantile_linear(vector<double> x, double p) { 22 if (x.empty()) throw runtime_error("Empty data in quantile"); 23 if (p <= 0.0) return *min_element(x.begin(), x.end()); 24 if (p >= 1.0) return *max_element(x.begin(), x.end()); 25 sort(x.begin(), x.end()); 26 double h = (x.size() - 1) * p; 27 size_t lo = static_cast<size_t>(floor(h)); 28 size_t hi = static_cast<size_t>(ceil(h)); 29 double frac = h - lo; 30 return (1.0 - frac) * x[lo] + frac * x[hi]; 31 } 32 33 int main() { 34 // Synthetic slightly noisy linear data: y = 2.0 * x + noise 35 vector<double> x, y; 36 { 37 mt19937_64 rng(7); 38 normal_distribution<double> noise(0.0, 1.0); 39 for (int i = 0; i < 30; ++i) { 40 double xi = i * 0.5; // spacing 41 x.push_back(xi); 42 y.push_back(2.0 * xi + noise(rng)); 43 } 44 } 45 46 size_t n = x.size(); 47 Fit fhat = fit_simple_linear(x, y); 48 49 mt19937_64 rng(2025); 50 uniform_int_distribution<size_t> unif(0, n - 1); 51 52 // Pairs bootstrap: resample (x,y) rows 53 size_t B = 20000; 54 vector<double> slopes; slopes.reserve(B); 55 vector<double> xb(n), yb(n); 56 57 for (size_t b = 0; b < B; ++b) { 58 for (size_t i = 0; i < n; ++i) { 59 size_t j = unif(rng); 60 xb[i] = x[j]; 61 yb[i] = y[j]; 62 } 63 Fit fb = fit_simple_linear(xb, yb); 64 slopes.push_back(fb.slope); 65 } 66 67 double alpha = 0.05; 68 double lo = quantile_linear(slopes, alpha/2.0); 69 double hi = quantile_linear(slopes, 1.0 - alpha/2.0); 70 71 cout.setf(ios::fixed); cout << setprecision(4); 72 cout << "Slope estimate: " << fhat.slope << "\n"; 73 cout << "95% CI (slope, percentile): [" << lo << ", " << hi << "]\n"; 74 75 return 0; 76 } 77
This example uses pairs bootstrap to build a percentile CI for the slope in simple linear regression. It resamples rows (x,y) to preserve the joint structure, refits the line on each resample, and takes the 2.5% and 97.5% quantiles of the bootstrap slope estimates. Pairs bootstrap is broadly applicable when residual assumptions are questionable.