🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
⚙️AlgorithmIntermediate

Bootstrap & Resampling Methods

Key Points

  • •
    Bootstrap is a resampling method that estimates uncertainty by repeatedly sampling with replacement from the observed data.
  • •
    It builds an empirical distribution of a statistic (like the mean or median) without assuming a specific underlying distribution.
  • •
    Percentile bootstrap confidence intervals take the lower and upper percentiles of the bootstrap replicates as the interval bounds.
  • •
    The basic (reverse) bootstrap interval adjusts the percentile bounds around the original estimate to partially correct bias.
  • •
    Time complexity is typically O(B n), where B is the number of resamples and n is the sample size, and memory is O(B) to store replicates.
  • •
    Use more resamples (e.g., B = 5,000–50,000) for stable intervals, especially for skewed distributions or tail quantiles.
  • •
    Do not use naive i.i.d. bootstrap for dependent data (time series); use block bootstrap or another method instead.
  • •
    Always fix a random seed for reproducibility and carefully compute quantiles (sorting and correct indexing).

Prerequisites

  • →Basic probability and sampling — Understanding random sampling, independence, and distributions helps interpret resampling and confidence intervals.
  • →Descriptive statistics (mean, median, variance) — Bootstrap estimates variability of these statistics; you must know how to compute them.
  • →Sorting and selection algorithms — Percentiles require ordering values; quantile computation relies on sorting or selection.
  • →C++ vectors, functions, and RNG — Implementation uses std::vector, std::function, and <random> for resampling.
  • →Simple linear regression — Pairs bootstrap example requires fitting and interpreting a regression slope.
  • →Monte Carlo methods — Bootstrap relies on repeated random simulation to approximate distributions.
  • →Quantiles and percentiles — Confidence intervals are formed from bootstrap quantiles; careful indexing is necessary.

Detailed Explanation

Tap terms for definitions

01Overview

Hook: Imagine you only have one sample of test scores, but your teacher asks, "How sure are you about the class average?" You cannot retake the test, so you need a way to estimate your uncertainty from just the data in hand. Concept: Bootstrap is a clever workaround that treats your sample as a tiny stand-in for the population. By repeatedly sampling with replacement from your observed data, you create many "fake" datasets (called bootstrap samples). For each fake dataset you recompute the statistic you care about (mean, median, slope, etc.). The spread of these recomputed values mimics how your statistic would vary if you could repeatedly sample from the real population. Example: Suppose you measured the battery life (in hours) of n phones and want a 95% confidence interval for the median. You draw B bootstrap resamples of size n from your original n values, compute the median for each resample, and then take the 2.5th and 97.5th percentiles of those medians. Those percentiles form a bootstrap confidence interval for the true median. This works even though the median’s classical standard error is hard to compute analytically. Bootstrap methods are widely used because they need far fewer assumptions than traditional formulas and are easy to implement on a computer.

02Intuition & Analogies

Think of your dataset as a jar of marbles, where each marble is one observed value. You’d like to understand how a statistic (like the mean color value) behaves across different jars drawn from the factory (the population). But you only have this one jar. The bootstrap says: if you can’t go back to the factory, pretend your jar is the factory. Shake the jar and draw n marbles with replacement to make a new jar—some marbles repeat, some get left out. Do this many times. Each new jar is a plausible alternate reality of what you could have observed. By computing your statistic on each resampled jar, you build a distribution of outcomes that reflects sampling variability. The middle 95% of these outcomes is where you’d expect your observed statistic to land most of the time if you repeated the study. That middle band becomes your confidence interval. The more resamples you generate, the smoother and more reliable this distribution becomes—much like getting a better picture by taking more photos from slightly different angles. Why does it help? Many statistics have complicated or unknown formulas for their uncertainty. The bootstrap replaces hard math with simulation: you let the computer imitate repeated experiments by reusing your data. It’s especially helpful for medians, trimmed means, regression coefficients under weird noise, or when the underlying distribution is skewed or heavy-tailed. Just remember: if your data points depend on each other (like hourly temperatures), drawing them independently from the jar breaks their structure, so you need a modified resampling strategy (e.g., block bootstrap).

03Formal Definition

Let X1​,…,Xn​ be i.i.d. observations from an unknown distribution F. Define the empirical distribution Fn​ that places probability 1/n at each observed value. A bootstrap sample X1∗​,…,Xn∗​ is drawn i.i.d. from Fn​ (equivalently, sample indices with replacement from {1,…,n}). Let the statistic of interest be θ^ = t(X1​,…,Xn​). For b=1,…,B, construct the b-th bootstrap replicate θb∗​ = t(X1∗(b)​,…,Xn∗(b)​), where each X∗(b) is an independent draw from Fn​. The empirical distribution of \{θb∗​\}_{b=1}^{B} approximates the sampling distribution of θ^. From this, we estimate uncertainty measures: - Bootstrap variance: the sample variance of \{θb∗​\}. - Percentile CI at level 1-α: [qα/2​(θ∗), q1−α/2​(θ∗)], where qp​ denotes the p-th quantile of the bootstrap replicates. - Basic (reverse) CI: [2θ^ - q1−α/2​(θ∗), 2θ^ - qα/2​(θ∗)]. - Bootstrap-t (studentized) CI uses tb∗​ = (θb∗​ - θ^)/seb∗​ and inverts quantiles of tb∗​. As B → ∞, and under regularity conditions, these bootstrap procedures provide consistent approximations to the true sampling behavior of θ^.

04When to Use

  • Complicated statistics: Use bootstrap when analytic formulas for standard errors or confidence intervals are messy or unknown (e.g., median, quantiles, trimmed means, ratios, regression with heteroskedasticity).
  • Small to moderate samples: When n is not large enough for the Central Limit Theorem to give a good approximation, bootstrap often performs better—though very small samples (e.g., n < 10) can still be unreliable.
  • Skewed or heavy-tailed data: Percentile or bootstrap-t intervals can adapt to asymmetry in the sampling distribution better than symmetric normal-based intervals.
  • Model validation: In regression, pairs bootstrap (resample (x,y) pairs) or residual bootstrap (resample residuals) can quantify uncertainty in slopes or predictions without strict distributional assumptions.
  • When not to use: If data are dependent (time series, spatial, clustered), i.i.d. bootstrap breaks dependence; instead use block/bootstrap by cluster. If data are censored or truncated, use specialized resampling schemes. With extreme outliers or highly non-smooth functionals, the percentile interval may undercover; consider studentization (bootstrap-t) or BCa adjustments.

⚠️Common Mistakes

  • Resampling without replacement: That is subsampling, not the classical bootstrap. Always sample n items with replacement from the original n.
  • Too few resamples (small B): Using B = 200 can yield noisy quantiles; prefer thousands (e.g., 5,000–50,000) for stable CI ends, especially at 2.5% and 97.5%.
  • Ignoring dependence: Applying i.i.d. bootstrap to time series or clustered data can severely misestimate uncertainty. Use block or cluster bootstrap.
  • Wrong quantile indexing: Percentiles require sorting and careful index selection or interpolation. Off-by-one errors shift coverage.
  • Not fixing a seed: Without a fixed random seed, results vary between runs, making debugging and reports inconsistent.
  • Confusing percentile and basic intervals: Percentile uses quantiles of \theta^{*} directly; basic reflects these around \hat{\theta}. They behave differently under bias and skewness.
  • Assuming bootstrap removes bias: Bootstrap estimates variability; it doesn’t guarantee unbiased point estimates. For severe bias or boundary parameters, consider bias correction (e.g., BCa) or better estimators.
  • Ignoring computation cost: For statistics that require sorting (e.g., median) or fitting models (regression), each replicate can be expensive. Plan B and consider parallelization.

Key Formulas

Empirical Distribution

Fn​(x)=n1​i=1∑n​1{Xi​≤x}

Explanation: The empirical distribution assigns equal probability to each observed data point. It is the bootstrap’s stand-in for the unknown population distribution.

Bootstrap Sampling Law

P(X∗=xi​)=n1​,  i=1,…,n

Explanation: A bootstrap draw picks any observed value with equal probability 1/n. Repeating n times with replacement yields a bootstrap sample.

Bootstrap Replicate

θb∗​=t(X1∗(b)​,…,Xn∗(b)​),b=1,…,B

Explanation: Each bootstrap replicate is the statistic computed on one resample. The collection of replicates approximates the sampling distribution of the statistic.

Bootstrap Variance

Varboot​(θ^)=B−11​b=1∑B​(θb∗​−θˉ∗)2

Explanation: The sample variance of the bootstrap replicates estimates the variance (squared standard error) of the statistic. The mean of replicates is θˉ∗.

Percentile CI

CIpercentile​=[ qα/2​(θ∗), q1−α/2​(θ∗) ]

Explanation: Take the lower and upper quantiles of the bootstrap replicates as the confidence bounds. This leverages the empirical distribution directly.

Basic (Reverse) CI

CIbasic​=[ 2θ^−q1−α/2​(θ∗), 2θ^−qα/2​(θ∗) ]

Explanation: Reflect the percentile bounds around the original estimate to adjust for potential bias and asymmetry in the bootstrap distribution.

Bootstrap-t CI

tb∗​=seb∗​θb∗​−θ^​,CIt​=[ θ^−q1−α/2​(t∗)se(θ^), θ^−qα/2​(t∗)se(θ^) ]

Explanation: Studentize each replicate using its own standard error, then invert quantiles of the t*-distribution. This often improves coverage for skewed or heteroskedastic cases.

Out-of-Bag Expectation

E[unique in a bootstrap sample]=n(1−(1−n1​)n)≈n(1−e−1)

Explanation: About 63.2% of unique observations appear in a bootstrap sample, leaving roughly 36.8% out-of-bag. This fact is useful for validation and error estimation.

Time Complexity

T(n,B)=O(Bn)

Explanation: Forming B resamples of size n and evaluating a simple O(n) statistic each time yields linear complexity in both B and n.

Nearest-Rank Quantile

k=⌈p(B+1)⌉,  qp​=θ(k)∗​

Explanation: A simple quantile estimator picks the k-th order statistic after sorting replicates. Alternatives use linear interpolation between neighbors.

Complexity Analysis

Let n be the sample size and B the number of bootstrap resamples. Generating one bootstrap sample requires O(n) draws from a uniform integer distribution, and computing a simple statistic like the mean takes O(n) time, so each replicate is O(n). Performing B replicates yields O(B n) time. If the statistic itself is more expensive—e.g., the median via sorting is O(n log n) or fitting a regression is O(n) to O(n3) depending on the algorithm—then total time becomes O(B · cost(statistic)). Memory can be as low as O(n) if you process replicates on the fly (compute, update quantile summaries, and discard). However, the common approach stores all B replicates to compute quantiles precisely, which costs O(B) extra memory. Quantile computation requires sorting the B replicates, adding O(B log B) time and O(B) memory. When B is large and the statistic is cheap, sorting can dominate. You can reduce memory/time by using selection algorithms (O(B)) or streaming quantile approximations (e.g., P², t-digest) at the cost of slight inaccuracy. Random number generation is also a factor; modern engines like mt19937_64 and vectorized/parallel sampling can mitigate overhead. Bootstrap is embarrassingly parallel: each replicate is independent, enabling near-linear speedup across cores or machines. For dependent data (e.g., time series), naive i.i.d. resampling understates variance; block bootstrap increases per-replicate cost by managing blocks but keeps the same big-O scaling. In all cases, choose B large enough for stable tail quantiles (often thousands to tens of thousands); the Monte Carlo error of percentile estimates shrinks like O(1/B​).

Code Examples

Percentile Bootstrap CI for Mean and Median (Generic Resampler)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute mean of a vector
5double mean(const vector<double>& v) {
6 return accumulate(v.begin(), v.end(), 0.0) / static_cast<double>(v.size());
7}
8
9// Compute median of a vector (copy and sort to avoid mutating input)
10double median(vector<double> v) {
11 size_t n = v.size();
12 nth_element(v.begin(), v.begin() + n/2, v.end());
13 double mid = v[n/2];
14 if (n % 2 == 1) return mid;
15 // Need the lower middle as well for even n
16 auto it = max_element(v.begin(), v.begin() + n/2);
17 return (mid + *it) / 2.0;
18}
19
20// Linear interpolation quantile in [0,1] on a copy (sorts the data)
21double quantile_linear(vector<double> x, double p) {
22 if (x.empty()) throw runtime_error("Empty data in quantile");
23 if (p <= 0.0) return *min_element(x.begin(), x.end());
24 if (p >= 1.0) return *max_element(x.begin(), x.end());
25 sort(x.begin(), x.end());
26 double h = (x.size() - 1) * p;
27 size_t lo = static_cast<size_t>(floor(h));
28 size_t hi = static_cast<size_t>(ceil(h));
29 double frac = h - lo;
30 return (1.0 - frac) * x[lo] + frac * x[hi];
31}
32
33// Generate B bootstrap replicates of a statistic
34vector<double> bootstrap_replicates(const vector<double>& data,
35 size_t B,
36 function<double(const vector<double>&)> stat,
37 mt19937_64& rng) {
38 const size_t n = data.size();
39 vector<double> reps;
40 reps.reserve(B);
41 uniform_int_distribution<size_t> unif(0, n - 1);
42 vector<double> resample(n);
43 for (size_t b = 0; b < B; ++b) {
44 // Sample n indices with replacement
45 for (size_t i = 0; i < n; ++i) resample[i] = data[unif(rng)];
46 // Compute the statistic on the resample
47 reps.push_back(stat(resample));
48 }
49 return reps;
50}
51
52// Build a (1 - alpha) percentile CI from bootstrap replicates
53pair<double,double> percentile_CI_from_reps(const vector<double>& reps, double alpha) {
54 vector<double> sorted = reps;
55 sort(sorted.begin(), sorted.end());
56 double lower = quantile_linear(sorted, alpha / 2.0);
57 double upper = quantile_linear(sorted, 1.0 - alpha / 2.0);
58 return {lower, upper};
59}
60
61int main() {
62 // Example dataset: exam scores (skewed by a few high performers)
63 vector<double> data = {58, 62, 65, 67, 69, 70, 72, 72, 74, 75,
64 76, 78, 80, 82, 85, 88, 92, 95};
65
66 // Reproducible RNG
67 mt19937_64 rng(42);
68
69 // Number of bootstrap resamples
70 size_t B = 20000; // large B for stable tail quantiles
71
72 // Bootstrap for mean
73 auto reps_mean = bootstrap_replicates(data, B, mean, rng);
74 auto ci_mean = percentile_CI_from_reps(reps_mean, 0.05); // 95% CI
75
76 // Bootstrap for median (statistic is more expensive due to selection/sort)
77 auto reps_median = bootstrap_replicates(data, B, median, rng);
78 auto ci_median = percentile_CI_from_reps(reps_median, 0.05);
79
80 cout.setf(ios::fixed); cout << setprecision(3);
81 cout << "Sample mean: " << mean(data) << "\n";
82 cout << "95% CI (mean, percentile): [" << ci_mean.first << ", " << ci_mean.second << "]\n";
83 // For clarity, compute and print median once more on original data
84 cout << "Sample median:" << ' ' << median(data) << "\n";
85 cout << "95% CI (median, percentile): [" << ci_median.first << ", " << ci_median.second << "]\n";
86
87 return 0;
88}
89

This program implements a generic bootstrap engine that resamples with replacement, computes a specified statistic on each resample, and forms a percentile confidence interval by taking the appropriate quantiles of the replicates. It demonstrates two statistics: the mean (O(n) per replicate) and the median (O(n) using nth_element plus a max_element for the even case, which avoids full sorting and is typically near-linear on average). A linear-interpolation quantile function provides smooth percentile estimates.

Time: Mean CI: O(B n). Median CI: approximately O(B n) with selection (worst-case O(B n log n) if full sorts are used). Quantile computation adds O(B log B).Space: O(B) for storing replicates plus O(n) for the working resample buffer.
Bootstrap-t (Studentized) CI for the Mean
1#include <bits/stdc++.h>
2using namespace std;
3
4double mean(const vector<double>& v) {
5 return accumulate(v.begin(), v.end(), 0.0) / static_cast<double>(v.size());
6}
7
8double sample_sd(const vector<double>& v) {
9 size_t n = v.size();
10 if (n < 2) return 0.0;
11 double m = mean(v);
12 double ss = 0.0;
13 for (double x : v) ss += (x - m) * (x - m);
14 return sqrt(ss / static_cast<double>(n - 1));
15}
16
17// Linear interpolation quantile
18double quantile_linear(vector<double> x, double p) {
19 if (x.empty()) throw runtime_error("Empty data in quantile");
20 if (p <= 0.0) return *min_element(x.begin(), x.end());
21 if (p >= 1.0) return *max_element(x.begin(), x.end());
22 sort(x.begin(), x.end());
23 double h = (x.size() - 1) * p;
24 size_t lo = static_cast<size_t>(floor(h));
25 size_t hi = static_cast<size_t>(ceil(h));
26 double frac = h - lo;
27 return (1.0 - frac) * x[lo] + frac * x[hi];
28}
29
30int main() {
31 // Slightly skewed data (e.g., response times in ms)
32 vector<double> data = {102, 110, 115, 118, 120, 122, 125, 130, 132, 135,
33 140, 145, 160, 200};
34
35 const size_t n = data.size();
36 mt19937_64 rng(2024);
37 uniform_int_distribution<size_t> unif(0, n - 1);
38
39 // Original estimates
40 double theta_hat = mean(data);
41 double se_hat = sample_sd(data) / sqrt((double)n);
42
43 // Bootstrap-t: build distribution of t* = (mean* - mean_hat) / se*
44 size_t B = 30000; // larger B stabilizes tail quantiles of t*
45 vector<double> tstars; tstars.reserve(B);
46 vector<double> res(n);
47
48 for (size_t b = 0; b < B; ++b) {
49 for (size_t i = 0; i < n; ++i) res[i] = data[unif(rng)];
50 double m = mean(res);
51 double se_star = sample_sd(res) / sqrt((double)n);
52 if (se_star == 0.0) continue; // all-equal resample; skip to avoid division by zero
53 double t = (m - theta_hat) / se_star;
54 tstars.push_back(t);
55 }
56
57 // Compute quantiles of t*
58 double alpha = 0.05;
59 double q_low = quantile_linear(tstars, alpha/2.0);
60 double q_high = quantile_linear(tstars, 1.0 - alpha/2.0);
61
62 // Invert to get CI
63 double lo = theta_hat - q_high * se_hat;
64 double hi = theta_hat - q_low * se_hat;
65
66 cout.setf(ios::fixed); cout << setprecision(3);
67 cout << "Mean: " << theta_hat << "\n";
68 cout << "SE (orig): " << se_hat << "\n";
69 cout << "95% CI (bootstrap-t): [" << lo << ", " << hi << "]\n";
70
71 return 0;
72}
73

This program constructs a bootstrap-t confidence interval for the mean. Each bootstrap replicate computes both the resample mean and its own standard error (s*/√n), forming t* = (mean* − mean_hat)/se*. The interval is obtained by inverting the empirical t*-quantiles and scaling by the original SE. Bootstrap-t often improves coverage for skewed or heteroskedastic data compared to the plain percentile interval.

Time: O(B n) for generating resamples and computing mean and standard deviation per replicate; quantile computation adds O(B log B).Space: O(B) for t* values plus O(n) for the resample buffer.
Pairs Bootstrap CI for Simple Linear Regression Slope
1#include <bits/stdc++.h>
2using namespace std;
3
4struct Fit { double slope; double intercept; };
5
6Fit fit_simple_linear(const vector<double>& x, const vector<double>& y) {
7 size_t n = x.size();
8 double mx = accumulate(x.begin(), x.end(), 0.0) / n;
9 double my = accumulate(y.begin(), y.end(), 0.0) / n;
10 double num = 0.0, den = 0.0;
11 for (size_t i = 0; i < n; ++i) {
12 num += (x[i] - mx) * (y[i] - my);
13 den += (x[i] - mx) * (x[i] - mx);
14 }
15 double slope = (den == 0.0) ? 0.0 : num / den;
16 double intercept = my - slope * mx;
17 return {slope, intercept};
18}
19
20// Linear interpolation quantile
21double quantile_linear(vector<double> x, double p) {
22 if (x.empty()) throw runtime_error("Empty data in quantile");
23 if (p <= 0.0) return *min_element(x.begin(), x.end());
24 if (p >= 1.0) return *max_element(x.begin(), x.end());
25 sort(x.begin(), x.end());
26 double h = (x.size() - 1) * p;
27 size_t lo = static_cast<size_t>(floor(h));
28 size_t hi = static_cast<size_t>(ceil(h));
29 double frac = h - lo;
30 return (1.0 - frac) * x[lo] + frac * x[hi];
31}
32
33int main() {
34 // Synthetic slightly noisy linear data: y = 2.0 * x + noise
35 vector<double> x, y;
36 {
37 mt19937_64 rng(7);
38 normal_distribution<double> noise(0.0, 1.0);
39 for (int i = 0; i < 30; ++i) {
40 double xi = i * 0.5; // spacing
41 x.push_back(xi);
42 y.push_back(2.0 * xi + noise(rng));
43 }
44 }
45
46 size_t n = x.size();
47 Fit fhat = fit_simple_linear(x, y);
48
49 mt19937_64 rng(2025);
50 uniform_int_distribution<size_t> unif(0, n - 1);
51
52 // Pairs bootstrap: resample (x,y) rows
53 size_t B = 20000;
54 vector<double> slopes; slopes.reserve(B);
55 vector<double> xb(n), yb(n);
56
57 for (size_t b = 0; b < B; ++b) {
58 for (size_t i = 0; i < n; ++i) {
59 size_t j = unif(rng);
60 xb[i] = x[j];
61 yb[i] = y[j];
62 }
63 Fit fb = fit_simple_linear(xb, yb);
64 slopes.push_back(fb.slope);
65 }
66
67 double alpha = 0.05;
68 double lo = quantile_linear(slopes, alpha/2.0);
69 double hi = quantile_linear(slopes, 1.0 - alpha/2.0);
70
71 cout.setf(ios::fixed); cout << setprecision(4);
72 cout << "Slope estimate: " << fhat.slope << "\n";
73 cout << "95% CI (slope, percentile): [" << lo << ", " << hi << "]\n";
74
75 return 0;
76}
77

This example uses pairs bootstrap to build a percentile CI for the slope in simple linear regression. It resamples rows (x,y) to preserve the joint structure, refits the line on each resample, and takes the 2.5% and 97.5% quantiles of the bootstrap slope estimates. Pairs bootstrap is broadly applicable when residual assumptions are questionable.

Time: O(B n) for resampling and O(n) per fit; quantile adds O(B log B). Overall O(B n).Space: O(B) to store slopes plus O(n) for resampled x and y.
#bootstrap#resampling#confidence intervals#percentile bootstrap#basic bootstrap#bootstrap-t#bca#empirical distribution#monte carlo#c++#random sampling#median#linear regression#sampling with replacement#variance estimation