Hypothesis Testing
Key Points
- •Hypothesis testing is a decision-making process to evaluate claims about a population using sample data.
- •You start with a null hypothesis (H0) that represents the status quo and an alternative hypothesis (H1) that represents a meaningful change or effect.
- •A test statistic summarizes the evidence in your data, and a p-value measures how surprising your data would be if H0 were true.
- •If the p-value is below a pre-chosen significance level α (like 0.05), you reject H0; otherwise, you fail to reject H0.
- •Type I error (false positive) happens with probability while Type II error (false negative) has probability the test’s power is 1 −
- •Different tests fit different situations: z-test for means with known variance, t-test for unknown variance, binomial tests for proportions, and permutation tests when assumptions fail.
- •Two-sided tests look for differences in either direction; one-sided tests look for a difference in a specific direction.
- •In code, hypothesis tests are mainly O(n) for scanning data, with exact combinatorial tests or Monte Carlo resampling trading time for fewer assumptions.
Prerequisites
- →Basic probability — Understanding events, independence, and distributions is necessary to interpret p-values and sampling distributions.
- →Descriptive statistics — Means, variances, and standard errors are core ingredients of test statistics.
- →Normal distribution and CLT — z-tests and many approximations rely on normality or the Central Limit Theorem.
- →Combinatorics (binomial coefficients) — Exact tests for proportions use binomial probabilities and combinations.
- →Numerical methods in C++ — Stable computation of CDFs and probabilities requires functions like lgamma, erfc, and careful floating-point handling.
Detailed Explanation
Tap terms for definitions01Overview
Hypothesis testing is a statistical framework for deciding whether observed data provide enough evidence to challenge a default belief about a population. The default belief is called the null hypothesis (H0), such as “the mean equals 10” or “two groups have the same average.” The competing claim is the alternative hypothesis (H1), which states that the parameter differs (two-sided) or is greater/less (one-sided). We compress our data into a test statistic—like a z-score or t-score—that measures how far the data are from what H0 predicts. Using the sampling distribution of that statistic under H0, we compute a p-value: the probability of seeing data as extreme as ours if H0 were true. If this probability is small (below a threshold α, often 0.05), we reject H0 in favor of H1; otherwise, we do not reject H0. This process helps control the frequency of false alarms (Type I errors) while aiming to detect real effects. Different tests apply in different scenarios, depending on assumptions like known variance, normality, independence, or sample size. When assumptions are doubtful, nonparametric or resampling-based tests (like permutation tests) provide robust alternatives. Hypothesis testing is widely used in A/B testing, quality control, medicine, and scientific research to make principled, quantifiable decisions from data.
02Intuition & Analogies
Imagine a courtroom. The defendant is presumed innocent (this is H0: no effect, no difference). The prosecutor presents evidence (your sample data). The jury asks: if the defendant were truly innocent (if H0 were true), how likely is it to see evidence this strong? That likelihood is the p-value. If it is very small, the jury concludes the evidence is too strong to be mere coincidence and returns a guilty verdict (reject H0). If it’s not small, the jury does not convict (fail to reject H0), though that doesn’t prove innocence—it only means the evidence isn’t strong enough. The significance level α is like the legal standard; a stricter standard (smaller α) makes convictions rarer but reduces false convictions (Type I errors). Conversely, there’s a risk of letting the guilty go free (Type II error) when real effects exist but the evidence appears weak, especially with small sample sizes. In a lab or business setting, H0 could be “our current process has a defect rate of 2%,” and H1 might be “the defect rate changed.” You take a sample and measure the defect count. A test statistic converts your data into a standardized score comparing what you saw to what H0 predicts. For a mean with known variability, that’s a z-score showing how many standard errors away your sample mean is from the claimed mean. If it’s far, observing it under H0 is unlikely, and you may reject H0. If assumptions like normality don’t hold, you can still compare groups by repeatedly shuffling labels (a permutation test) to see how unusual your observed difference is compared to what random chance would yield. This keeps the spirit of the courtroom analogy while reducing reliance on fragile assumptions.
03Formal Definition
04When to Use
Use hypothesis testing when you need a yes/no decision about a claim backed by quantified uncertainty. Typical cases include A/B testing in software (does variant B improve conversion?), manufacturing quality control (has the defect rate changed?), scientific experiments (is a treatment effective?), and cybersecurity (has the mean response time anomalously increased?). Choose the test based on data type and assumptions: use a one-sample z-test for means when σ is known and samples are independent and approximately normal or large (by the Central Limit Theorem); use a one-sample t-test when σ is unknown but normality is plausible; use binomial or proportion tests for counts of successes; use chi-square for categorical independence or goodness-of-fit; and use permutation tests when distributional assumptions are questionable or sample sizes are small. Prefer two-sided tests if any difference (higher or lower) matters; use one-sided only with a clear directional hypothesis specified before seeing the data. If running many tests (multiple metrics or many variants), adjust for multiplicity (e.g., Bonferroni or Holm) to control the overall false positive rate. When estimation matters as much as detection, report confidence intervals and effect sizes alongside p-values to convey practical significance.
⚠️Common Mistakes
- Misinterpreting p-values: a p-value is not the probability H0 is true; it is the probability of observing data as extreme as yours if H0 were true. 2) Equating “fail to reject” with “accepting H0”: not rejecting simply means insufficient evidence, not proof of no effect. 3) Post-hoc switching between one- and two-sided tests: deciding tail direction after seeing the data inflates false positives. 4) Ignoring assumptions: using a z-test or t-test without checking independence, normality (for small n), or equal variances can yield misleading conclusions. 5) Multiple comparisons: testing many hypotheses at α = 0.05 virtually guarantees some false positives unless you adjust α. 6) Overreliance on statistical significance: a tiny p-value with a huge sample can correspond to a trivial effect size—always consider practical impact. 7) P-hacking and data snooping: repeated peeking, selective reporting, or trying many analyses until something “works” invalidates p-values. 8) Numerical pitfalls: computing exact binomial probabilities via naive factorials causes overflow/underflow; use log-gamma and log-sum-exp tricks. 9) Confusing confidence intervals and tests: for two-sided tests, rejecting at level α is equivalent to the null value lying outside the (1 − α) confidence interval; inconsistency often signals a mistake. 10) Using permutation tests incorrectly: ensure exchangeability under H0; maintain group sizes and use enough permutations or Monte Carlo replicates for stable p-values.
Key Formulas
One-sample z-statistic
Explanation: This standardizes the distance between the sample mean and the hypothesized mean when the population standard deviation is known. Under H0 and normality (or large n), z follows a standard normal distribution.
One-sample t-statistic
Explanation: When the population standard deviation is unknown, replace it with the sample standard deviation s. Under H0 and normality, t follows a t-distribution with n − 1 degrees of freedom.
Chi-square statistic
Explanation: Compares observed counts to expected counts across categories. Under H0 and large-sample conditions, this statistic follows a chi-square distribution with k − 1 degrees of freedom (adjusted for estimated parameters).
Right-tailed p-value
Explanation: For a right-tailed test, the p-value is the probability under the null of observing a test statistic at least as large as the observed one. For left- or two-tailed tests, the tail definition changes accordingly.
Binomial PMF
Explanation: Gives the probability of exactly k successes in n independent Bernoulli trials with success probability p. This underlies the exact binomial test for proportions.
Exact binomial two-sided p-value
Explanation: The conservative two-sided p-value sums probabilities of outcomes with probability at most that of the observed count under H0. This ensures the test does not exceed the nominal α level.
Normal CDF via erfc
Explanation: The standard normal cumulative distribution function can be computed from the complementary error function. This relation allows numerically stable p-value computation in C++ using std::erfc.
Two-sided z-test p-value
Explanation: For a symmetric normal distribution, the two-sided p-value doubles the upper-tail probability beyond |z|. Using erfc avoids subtractive cancellation for large |z|.
Central Limit Theorem (standardized mean)
Explanation: As sample size grows, the standardized sample mean approaches a standard normal distribution. This justifies z-based inference even when underlying data are not exactly normal.
Bonferroni correction
Explanation: When testing m hypotheses, using per test controls the family-wise error rate at It is simple and conservative, guarding against multiple-comparison inflation.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Compute standard normal CDF using erfc (numerically stable) 5 // Phi(z) = 0.5 * erfc(-z / sqrt(2)) 6 static inline double normal_cdf(double z) { 7 return 0.5 * erfc(-z / M_SQRT2); // M_SQRT2 = sqrt(2) 8 } 9 10 struct ZTestResult { 11 double mean; 12 double z; 13 double p_two_sided; 14 }; 15 16 ZTestResult z_test_mean_known_sigma(const vector<double>& x, double mu0, double sigma) { 17 if (x.empty()) throw invalid_argument("Sample must be non-empty"); 18 if (!(sigma > 0)) throw invalid_argument("Sigma must be positive"); 19 20 // Compute sample mean in O(n) 21 double sum = accumulate(x.begin(), x.end(), 0.0); 22 double mean = sum / x.size(); 23 24 // Standard error and z-statistic 25 double se = sigma / sqrt((double)x.size()); 26 double z = (mean - mu0) / se; 27 28 // Two-sided p-value: p = 2*(1 - Phi(|z|)) = erfc(|z|/sqrt(2)) 29 double p_two = erfc(fabs(z) / M_SQRT2); 30 31 return {mean, z, p_two}; 32 } 33 34 int main() { 35 // Example: test H0: mu = 10 with known sigma = 2 36 vector<double> data = {9.5, 10.4, 9.8, 10.1, 10.7, 9.9, 10.3, 9.6}; 37 double mu0 = 10.0; 38 double sigma = 2.0; 39 40 auto res = z_test_mean_known_sigma(data, mu0, sigma); 41 42 cout.setf(ios::fixed); cout << setprecision(6); 43 cout << "Sample mean = " << res.mean << "\n"; 44 cout << "z-statistic = " << res.z << "\n"; 45 cout << "Two-sided p-value = " << res.p_two_sided << "\n"; 46 47 double alpha = 0.05; 48 cout << (res.p_two_sided < alpha ? "Reject H0 at alpha=0.05" : "Fail to reject H0 at alpha=0.05") << "\n"; 49 return 0; 50 } 51
Computes a two-sided z-test for the mean when the population standard deviation σ is known. It calculates the sample mean, standardizes it as a z-score, and uses the complementary error function to obtain a numerically stable p-value. This is appropriate for large samples or when σ is known from prior knowledge.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Log of binomial PMF using lgamma for stability 5 static inline long double log_binom_pmf(int n, int k, long double p) { 6 if (k < 0 || k > n) return -numeric_limits<long double>::infinity(); 7 if (p == 0.0L) return (k == 0) ? 0.0L : -numeric_limits<long double>::infinity(); 8 if (p == 1.0L) return (k == n) ? 0.0L : -numeric_limits<long double>::infinity(); 9 long double lnC = lgammal(n + 1.0L) - lgammal(k + 1.0L) - lgammal((n - k) + 1.0L); 10 return lnC + k * logl(p) + (n - k) * logl(1.0L - p); 11 } 12 13 struct BinomialTestResult { 14 int n, k; 15 long double p0; 16 long double p_value_two_sided; 17 }; 18 19 BinomialTestResult exact_binomial_test_two_sided(int n, int k_obs, long double p0) { 20 if (n <= 0) throw invalid_argument("n must be positive"); 21 if (k_obs < 0 || k_obs > n) throw invalid_argument("k must be in [0,n]"); 22 if (!(p0 > 0.0L && p0 < 1.0L)) throw invalid_argument("p0 must be in (0,1)"); 23 24 long double logp_obs = log_binom_pmf(n, k_obs, p0); 25 26 // Conservative two-sided p-value: sum probabilities with pmf <= pmf_observed 27 long double p_sum = 0.0L; 28 for (int k = 0; k <= n; ++k) { 29 long double lp = log_binom_pmf(n, k, p0); 30 if (lp <= logp_obs + 1e-18L) { // tiny tolerance 31 p_sum += expl(lp); 32 } 33 } 34 35 // Clamp due to numerical rounding 36 p_sum = min<long double>(1.0L, max<long double>(0.0L, p_sum)); 37 return {n, k_obs, p0, p_sum}; 38 } 39 40 int main() { 41 // Example: 20 trials, 7 successes, test H0: p = 0.25 42 int n = 20, k = 7; long double p0 = 0.25L; 43 auto res = exact_binomial_test_two_sided(n, k, p0); 44 45 cout.setf(ios::fixed); cout << setprecision(10); 46 cout << "n = " << res.n << ", k = " << res.k << ", p0 = " << (double)res.p0 << "\n"; 47 cout << "Two-sided exact p-value = " << (double)res.p_value_two_sided << "\n"; 48 49 double alpha = 0.05; 50 cout << (res.p_value_two_sided < alpha ? "Reject H0" : "Fail to reject H0") << " at alpha=0.05\n"; 51 return 0; 52 } 53
Implements the exact binomial test for a proportion by summing probabilities of outcomes at least as unlikely as the observed count under H0. It uses log-gamma for numerical stability, then exponentiates to sum in probability space. This is accurate for small and moderate n and does not rely on normal approximations.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct PermTestResult { 5 double observed_abs_diff; 6 double p_value_two_sided; 7 }; 8 9 static double mean_of(const vector<double>& v) { 10 return accumulate(v.begin(), v.end(), 0.0) / max<size_t>(1, v.size()); 11 } 12 13 PermTestResult permutation_test_diff_means(const vector<double>& A, const vector<double>& B, int B_iters, uint64_t seed = 1234567ULL) { 14 if (A.empty() || B.empty()) throw invalid_argument("Both groups must be non-empty"); 15 if (B_iters <= 0) throw invalid_argument("Number of permutations must be positive"); 16 17 // Observed statistic: absolute difference in means 18 double obs = fabs(mean_of(A) - mean_of(B)); 19 20 // Combine data 21 vector<double> all; all.reserve(A.size() + B.size()); 22 all.insert(all.end(), A.begin(), A.end()); 23 all.insert(all.end(), B.begin(), B.end()); 24 25 size_t n1 = A.size(); size_t n2 = B.size(); 26 27 // Random generator 28 mt19937_64 rng(seed); 29 30 int extreme = 0; 31 vector<double> tmp = all; // will be shuffled in-place 32 33 for (int it = 0; it < B_iters; ++it) { 34 shuffle(tmp.begin(), tmp.end(), rng); 35 36 // Compute means of the first n1 and remaining n2 elements 37 double sum1 = 0.0; for (size_t i = 0; i < n1; ++i) sum1 += tmp[i]; 38 double sum2 = 0.0; for (size_t i = n1; i < n1 + n2; ++i) sum2 += tmp[i]; 39 double diff = fabs((sum1 / n1) - (sum2 / n2)); 40 41 if (diff >= obs - 1e-15) ++extreme; // count as or more extreme 42 } 43 44 // Monte Carlo p-value with +1 correction for stability 45 double p = (extreme + 1.0) / (B_iters + 1.0); 46 return {obs, p}; 47 } 48 49 int main() { 50 vector<double> A = {10.1, 9.9, 10.3, 10.0, 9.8}; 51 vector<double> B = {10.7, 10.5, 10.6, 10.4}; 52 53 int permutations = 20000; // increase for more precision 54 auto res = permutation_test_diff_means(A, B, permutations, 42); 55 56 cout.setf(ios::fixed); cout << setprecision(6); 57 cout << "Observed |mean(A) - mean(B)| = " << res.observed_abs_diff << "\n"; 58 cout << "Two-sided permutation p-value (Monte Carlo) = " << res.p_value_two_sided << "\n"; 59 60 double alpha = 0.05; 61 cout << (res.p_value_two_sided < alpha ? "Reject H0 (equal means)" : "Fail to reject H0 (equal means)") << " at alpha=0.05\n"; 62 return 0; 63 } 64
Performs a two-sided permutation test comparing means of two groups by shuffling labels and recomputing the difference in means. This approach requires only exchangeability under H0 and is robust when normality or equal-variance assumptions are doubtful. The Monte Carlo p-value converges to the true permutation p-value as the number of shuffles increases.