∑MathIntermediate

Confidence Intervals & Prediction Intervals

Key Points

•
A confidence interval estimates a fixed but unknown parameter (like a population mean) with a range that would capture the true value in a long run of repeated samples.
•
A prediction interval estimates the range where a future individual observation will fall, which is wider because it includes both parameter uncertainty and natural randomness.
•
For means with unknown variance and normal data, confidence intervals use the t-distribution with n−1 degrees of freedom.
•
Prediction intervals for a single future observation add an extra 1 under the square root to account for observation noise.
•
In linear regression, you can compute intervals for the mean response at x0 and a wider prediction interval for a new y at x0.
•
Frequentist confidence is about long-run coverage, not the probability that a specific computed interval contains the parameter.
•
Always check assumptions: independence, approximate normality (or large n for CLT), and correct model specification.
•
C++ implementations require numerical routines for quantiles; you can use an accurate normal inverse and a Cornish–Fisher expansion to approximate t-quantiles.

Prerequisites

→Basic Probability and Distributions — Understanding normal and t-distributions, expectations, and variance is essential for constructing intervals.
→Descriptive Statistics — You must compute means, variances, and standard deviations to plug into interval formulas.
→Central Limit Theorem — Explains why normal approximations for the mean are valid as sample sizes grow.
→Simple Linear Regression — Needed for understanding intervals for the mean response and prediction in a regression context.
→Numerical Methods for Quantiles — Implementing critical values in C++ requires inverse CDF approximations or numerical solvers.

Detailed Explanation

Tap terms for definitions

01Overview

Imagine you measure the heights of 30 plants to learn about the average height in the whole greenhouse. You know your sample average, but it is only an estimate. A confidence interval (CI) wraps that estimate with a margin of error to indicate a plausible range for the true average height. If you repeated the experiment many times, most of those computed intervals would cover the true parameter at the advertised rate (for example, 95%). Now, suppose you want to forecast the height of the next plant you will measure. A prediction interval (PI) answers that different question by providing a range in which a single future observation is likely to fall; it is wider because it must also reflect the plant-to-plant variability. Conceptually, confidence intervals quantify uncertainty about a fixed but unknown parameter (like a mean, difference of means, slope), while prediction intervals quantify uncertainty about new data points. Mathematically, these intervals are built using sampling distributions. For means, we use the normal or t-distribution depending on whether the population variance is known. In regression, we use the fitted model, residual variance, and leverage at the prediction point to construct intervals. In practice, you build an interval by taking an $estimate ± a$ critical value times a standard error. The key is to choose the right standard error and distribution so that the interval has the desired long-run coverage.

02Intuition & Analogies

Think of throwing darts at a hidden bullseye. The bullseye is the true parameter (like the real average height in the greenhouse), and your dart throws are sample estimates that jiggle around due to random sampling. A confidence interval is like drawing a circle around where your dart landed, sized so that, if you kept throwing darts and redrawing circles, a fixed percentage of those circles would cover the bullseye. Crucially, the bullseye does not move; your circles do. That’s why we say a 95% confidence interval has 95% coverage in repeated sampling—not that the parameter is 95% probable to be in the particular circle you drew. A prediction interval is more like forecasting where the next dart will land relative to the bullseye. Even if you knew the bullseye exactly, the next dart will still scatter due to randomness. So for prediction we combine two uncertainties: (1) we don’t know the bullseye’s location precisely (estimation uncertainty), and (2) throws are inherently variable (process noise). This is why prediction intervals are wider than confidence intervals based on the same data. In linear regression, picture a best-fit line through a cloud of points. A CI for the mean response at some x0 is a narrow band around the line, reflecting uncertainty about the line’s position. A PI for a new point at x0 is a wider band, reflecting both uncertainty about the line and the fact that new points deviate from the line due to residual scatter. Farther from the center of your observed x-values, leverage increases and both bands widen—just as you’d expect when extrapolating.

03Formal Definition

Let

X_{1}

\dots

X_{n}

be i.i.d. observations from a distribution parameterized by

θ

. A 100(1-

α

)% confidence interval (CI) for

θ

is a random interval C(

X_{1 : n}

) such that

P_{θ}

(

θ

\in

X_{1 : n}

))

\geq

α

for all

θ

, where the probability is over the sampling distribution of the data. For a normal mean with unknown variance, the classic CI is

\overset{ˉ}{X}

\pm

t_{1 - α /2, n - 1}

\cdot

n

, where S is the sample standard deviation and

t_{1 - α /2, ν}

is the (1-

α

/2) quantile of Student’s t with

ν

degrees of freedom. A 100(1-

α

)% prediction interval (PI) for a future observation X_{

new

} is a random interval P(

X_{1 : n}

) such that

P_{θ}

(X_{

new

}

\in

X_{1 : n}

))

\geq

α

. Under normality with unknown variance, a PI for a single future observation is

\overset{ˉ}{X}

\pm

t_{1 - α /2, n - 1}

\cdot

1 + 1/ n

. In simple linear regression Y =

β_{0}

β_{1}

x +

ε

with

ε

\sim

N

(0,

σ^{2}

), the CI for the mean response at

x_{0}

\overset{y}{^}

(

x_{0}

)

\pm

t_{1 - α /2, n - 2}

\cdot

\overset{σ}{^}

\sqrt

{

\frac{1}{n}

\frac{( x _{0} - x ˉ ) ^{2}}{\sum ( x _{i} - x ˉ ) ^{2}}

}, while the PI for a new response at

x_{0}

\overset{y}{^}

(

x_{0}

)

\pm

t_{1 - α /2, n - 2}

\cdot

\overset{σ}{^}

\sqrt

{1 +

\frac{1}{n}

\frac{( x _{0} - x ˉ ) ^{2}}{\sum ( x _{i} - x ˉ ) ^{2}}

}. These intervals achieve nominal coverage asymptotically under mild conditions (e.g., by the Central Limit Theorem) or exactly under normality with known forms. The essential building blocks are an estimator, its standard error, and a pivotal quantity whose distribution does not depend on unknown parameters.

04When to Use

Use a confidence interval when your goal is to quantify uncertainty about a fixed but unknown parameter: population mean, difference of means, regression slope, or a proportion. For example, after running an A/B test, you may build a CI for the difference in conversion rates to decide if the effect is practically significant. In scientific measurement, CIs help report uncertainty around estimates of physical constants or treatment effects. Use a prediction interval when your goal is to forecast a single future observation or a small number of future values: the next temperature reading, tomorrow’s sales, or the next latency measurement in a system. In regression, select a CI if you care about the expected value E[Y|X= $x_{0}$ ] (e.g., average fuel efficiency at a speed), and a PI if you care about an actual new Y at $x_{0}$ (e.g., the next car’s efficiency), which varies more around the mean. If sample sizes are small and the population variance is unknown but data are approximately normal, prefer t-based intervals. For large samples, normal approximations generally suffice (by the CLT). In linear models, ensure the model is appropriate (linearity, homoscedasticity, independent errors) before trusting the intervals. When making multiple intervals, consider adjustments (e.g., Bonferroni) to control family-wise error.

⚠️Common Mistakes

Misinterpreting confidence: saying “there is a 95% probability the true mean lies in this computed interval” is incorrect in the frequentist framework. The correct statement is about long-run coverage across repeated samples. Another mistake is using a CI when you actually need a PI—forecasts for individual outcomes will be too narrow if you report a CI for the mean. Ignoring assumptions is common. t-intervals assume approximate normality of the sampling distribution of the mean; with small n and heavy tails or strong skew, coverage can be poor. In regression, failing to check linearity, constant variance, or independence can lead to misleading intervals. Extrapolating too far beyond the observed x-range inflates error due to high leverage. Using the wrong standard error or degrees of freedom is another trap: for unknown variance use S and df = n−1 (or n−p in regression), not the population \sigma. When sample sizes are tiny, blindly applying asymptotic normal intervals is risky; consider exact or nonparametric methods. Finally, reporting just the interval without the context (assumptions, sample size, and method) can mislead stakeholders; always accompany intervals with method notes and diagnostics.

Key Formulas

Sample Mean

\overset{ˉ}{X} = \frac{1}{n} i = 1 \sum n X_{i}

Explanation: The sample mean is the average of observed values and serves as an estimator for the population mean. It is the center point for many confidence and prediction intervals.

Sample Variance

S^{2} = \frac{1}{n - 1} i = 1 \sum n (X_{i} - \overset{ˉ}{X})^{2}

Explanation: This is the unbiased estimator of the population variance. It measures variability of observations around the sample mean.

Z-interval (known variance)

C I_{μ} : \overset{ˉ}{X} \pm z_{1 - α /2} \cdot \frac{σ}{n}

Explanation: When the population variance is known and data are normal (or n is large), the confidence interval for the mean uses the standard normal critical value. The width shrinks as n increases.

t-interval (unknown variance)

C I_{μ} : \overset{ˉ}{X} \pm t_{1 - α /2, n - 1} \cdot \frac{S}{n}

Explanation: When variance is unknown, replace $σ$ by S and use a t critical value with n−1 degrees of freedom. This yields exact coverage under normality.

One-sample prediction interval

P I_{one future} : \overset{ˉ}{X} \pm t_{1 - α /2, n - 1} \cdot S 1 + \frac{1}{n}

Explanation: A future observation deviates from the mean due to both estimation error and inherent noise. The extra 1 inside the square root accounts for observation noise.

OLS estimates (simple regression)

\hat{β}_{1} = \frac{\sum ( x _{i} - x ˉ ) ( y _{i} - y ˉ )}{\sum ( x _{i} - x ˉ ) ^{2}}, \hat{β}_{0} = \overset{y}{ˉ} - \hat{β}_{1} \overset{x}{ˉ}

Explanation: These formulas compute the slope and intercept of the best-fit line in least squares. They are used to form predictions and intervals in regression.

Residual variance (simple regression)

\overset{σ}{^}^{2} = \frac{1}{n - 2} i = 1 \sum n (y_{i} - \hat{β}_{0} - \hat{β}_{1} x_{i})^{2}

Explanation: This estimates the error variance in simple linear regression using n−2 degrees of freedom. It drives the width of regression intervals.

Regression CI for mean response

C I_{E [Y ∣ x_{0}]} : \overset{y}{^}_{0} \pm t_{1 - α /2, n - 2} \cdot \overset{σ}{^} \frac{1}{n} + \frac{( x _{0} - x ˉ ) ^{2}}{\sum ( x _{i} - x ˉ ) ^{2}}

Explanation: This interval quantifies uncertainty about the expected response at x0. It narrows with more data and widens farther from the center of x.

Regression prediction interval

P I_{Y_{new} ∣ x_{0}} : \overset{y}{^}_{0} \pm t_{1 - α /2, n - 2} \cdot \overset{σ}{^} 1 + \frac{1}{n} + \frac{( x _{0} - x ˉ ) ^{2}}{\sum ( x _{i} - x ˉ ) ^{2}}

Explanation: This covers a single new observation at x0 and is wider than the CI because it includes the observation noise term (the leading 1 under the root).

Coverage definition

P_{θ} (θ \in C (X_{1 : n})) = 1 - α

Explanation: By design, a 100(1− $α)$ confidence procedure satisfies this coverage probability across repeated sampling. It is a property of the method, not of a specific realized interval.

Complexity Analysis

Computing confidence and prediction intervals in the one-sample mean setting requires simple summary statistics: the sample mean and sample variance. These can be obtained in a single pass over n data points, yielding O(n) time and O(1) auxiliary space. The critical value lookup (z or t quantile) is effectively O(1) assuming a constant-time approximation or a small, bounded number of iterations in a numerical solver. Thus, the overall complexity is dominated by scanning the data. In simple linear regression, computing the slope, intercept, and residual variance requires basic sums:

\sum

x_{i}

\sum

y_{i}

\sum

x_{i}

^2,

\sum

x_{i}

y_{i}

, and

\sum

y_{i}

^2. Each sum aggregates in a single pass, so fitting the model and subsequently forming intervals for any fixed x0 is O(n) time with O(1) extra space. Once the model is fit, computing an interval at each new x0 is O(1), making batch predictions efficient after an initial O(n) fit. If you need intervals at many x0 values, the amortized cost per interval remains O(1) as long as you reuse precomputed statistics (

\overset{x}{ˉ}

S_{xx}

, and

\overset{σ}{^}

). From a numerical standpoint, the most delicate operation is evaluating distribution quantiles. Using a high-quality inverse normal CDF (e.g., Acklam’s rational approximation) runs in constant time with machine-precision accuracy. t-quantiles can be approximated from normal quantiles via a Cornish–Fisher expansion with error diminishing as degrees of freedom grow; this also runs in constant time per query. Therefore, for typical statistical workloads in C++, both CI and PI computations are linear in the number of data points with negligible constant overhead for quantile evaluation and arithmetic.

Code Examples

One-sample t Confidence Interval and One-step-ahead Prediction Interval

1 #include <iostream>
2 #include <vector>
3 #include <cmath>
4 #include <stdexcept>
5 #include <numeric>
6 #include <algorithm>
7 
8 // Inverse standard normal CDF (Acklam's approximation)
9 // Returns z such that Phi(z) = p for 0 < p < 1
10 static double inv_norm_cdf(double p) {
11     if (p <= 0.0 || p >= 1.0) throw std::invalid_argument("p must be in (0,1)");
12     // Coefficients for Acklam's approximation
13     const double a1 = -3.969683028665376e+01;
14     const double a2 =  2.209460984245205e+02;
15     const double a3 = -2.759285104469687e+02;
16     const double a4 =  1.383577518672690e+02;
17     const double a5 = -3.066479806614716e+01;
18     const double a6 =  2.506628277459239e+00;
19 
20     const double b1 = -5.447609879822406e+01;
21     const double b2 =  1.615858368580409e+02;
22     const double b3 = -1.556989798598866e+02;
23     const double b4 =  6.680131188771972e+01;
24     const double b5 = -1.328068155288572e+01;
25 
26     const double c1 = -7.784894002430293e-03;
27     const double c2 = -3.223964580411365e-01;
28     const double c3 = -2.400758277161838e+00;
29     const double c4 = -2.549732539343734e+00;
30     const double c5 =  4.374664141464968e+00;
31     const double c6 =  2.938163982698783e+00;
32 
33     const double d1 =  7.784695709041462e-03;
34     const double d2 =  3.224671290700398e-01;
35     const double d3 =  2.445134137142996e+00;
36     const double d4 =  3.754408661907416e+00;
37 
38     const double plow = 0.02425;
39     const double phigh = 1.0 - plow;
40 
41     double q, r;
42     if (p < plow) {
43         // Rational approximation for lower region
44         q = std::sqrt(-2.0 * std::log(p));
45         return (((((c1*q + c2)*q + c3)*q + c4)*q + c5)*q + c6) /
46                ((((d1*q + d2)*q + d3)*q + d4)*q + 1.0);
47     } else if (phigh < p) {
48         // Rational approximation for upper region
49         q = std::sqrt(-2.0 * std::log(1.0 - p));
50         return -(((((c1*q + c2)*q + c3)*q + c4)*q + c5)*q + c6) /
51                  ((((d1*q + d2)*q + d3)*q + d4)*q + 1.0);
52     } else {
53         // Rational approximation for central region
54         q = p - 0.5;
55         r = q * q;
56         return (((((a1*r + a2)*r + a3)*r + a4)*r + a5)*r + a6) * q /
57                (((((b1*r + b2)*r + b3)*r + b4)*r + b5)*r + 1.0);
58     }
59 }
60 
61 // Approximate t-quantile using Cornish-Fisher expansion around normal
62 // Returns t such that F_t(t; df) = p for df >= 2 approximately
63 static double t_quantile(double p, double df) {
64     if (df <= 1.5) throw std::invalid_argument("degrees of freedom must be > 1.5");
65     double z = inv_norm_cdf(p);
66     // Cornish-Fisher expansion terms for Student's t (up to 1/df^2)
67     double z3 = z*z*z;
68     double z5 = z3*z*z;
69     double z7 = z5*z*z;
70     double a = (z3 + z) / (4.0*df);
71     double b = (5.0*z5 + 16.0*z3 + 3.0*z) / (96.0*df*df);
72     double c = (3.0*z7 + 19.0*z5 + 17.0*z3 - 15.0*z) / (384.0*df*df*df); // optional higher-order
73     return z + a + b + c; // accurate for df >= ~5; still reasonable for smaller df
74 }
75 
76 struct OneSampleIntervals {
77     double mean;
78     double s;
79     double ci_low;
80     double ci_high;
81     double pi_low;
82     double pi_high;
83 };
84 
85 OneSampleIntervals compute_one_sample_intervals(const std::vector<double>& x, double alpha = 0.05) {
86     size_t n = x.size();
87     if (n < 2) throw std::invalid_argument("Need at least 2 observations");
88 
89     // Compute mean
90     double sum = std::accumulate(x.begin(), x.end(), 0.0);
91     double mean = sum / static_cast<double>(n);
92 
93     // Compute unbiased sample variance
94     double s2 = 0.0;
95     for (double xi : x) {
96         double d = xi - mean;
97         s2 += d * d;
98     }
99     s2 /= static_cast<double>(n - 1);
100     double s = std::sqrt(s2);
101 
102     // t critical for two-sided (1 - alpha) interval
103     double tcrit = t_quantile(1.0 - alpha/2.0, static_cast<double>(n - 1));
104 
105     // Confidence interval for mean
106     double se_mean = s / std::sqrt(static_cast<double>(n));
107     double ci_low = mean - tcrit * se_mean;
108     double ci_high = mean + tcrit * se_mean;
109 
110     // Prediction interval for one future observation
111     double se_pred = s * std::sqrt(1.0 + 1.0/static_cast<double>(n));
112     double pi_low = mean - tcrit * se_pred;
113     double pi_high = mean + tcrit * se_pred;
114 
115     return {mean, s, ci_low, ci_high, pi_low, pi_high};
116 }
117 
118 int main() {
119     // Example data: measured response times (ms)
120     std::vector<double> data = {102, 98, 105, 110, 95, 101, 99, 107, 103, 100};
121     double alpha = 0.05; // 95% intervals
122 
123     try {
124         auto res = compute_one_sample_intervals(data, alpha);
125         std::cout << "n = " << data.size() << "\n";
126         std::cout << "Sample mean = " << res.mean << ", sample s = " << res.s << "\n";
127         std::cout << "95% CI for mean: [" << res.ci_low << ", " << res.ci_high << "]\n";
128         std::cout << "95% PI for next observation: [" << res.pi_low << ", " << res.pi_high << "]\n";
129     } catch (const std::exception& e) {
130         std::cerr << "Error: " << e.what() << "\n";
131         return 1;
132     }
133     return 0;
134 }
135

This program computes a one-sample t confidence interval for the population mean and a prediction interval for a single future observation under normality with unknown variance. It first calculates the sample mean and unbiased sample variance, then uses an accurate inverse normal CDF and a Cornish–Fisher expansion to approximate the t critical value. The CI uses S/√n as the standard error, while the PI uses S·√(1 + 1/n), making the prediction interval wider to reflect observation noise.

Time: O(n)Space: O(1)

Simple Linear Regression: CI for Mean Response and PI for New Observation at x0

1 #include <iostream>
2 #include <vector>
3 #include <cmath>
4 #include <stdexcept>
5 #include <numeric>
6 #include <algorithm>
7 
8 // Inverse standard normal CDF (Acklam's approximation)
9 static double inv_norm_cdf(double p) {
10     if (p <= 0.0 || p >= 1.0) throw std::invalid_argument("p must be in (0,1)");
11     const double a1 = -3.969683028665376e+01;
12     const double a2 =  2.209460984245205e+02;
13     const double a3 = -2.759285104469687e+02;
14     const double a4 =  1.383577518672690e+02;
15     const double a5 = -3.066479806614716e+01;
16     const double a6 =  2.506628277459239e+00;
17 
18     const double b1 = -5.447609879822406e+01;
19     const double b2 =  1.615858368580409e+02;
20     const double b3 = -1.556989798598866e+02;
21     const double b4 =  6.680131188771972e+01;
22     const double b5 = -1.328068155288572e+01;
23 
24     const double c1 = -7.784894002430293e-03;
25     const double c2 = -3.223964580411365e-01;
26     const double c3 = -2.400758277161838e+00;
27     const double c4 = -2.549732539343734e+00;
28     const double c5 =  4.374664141464968e+00;
29     const double c6 =  2.938163982698783e+00;
30 
31     const double d1 =  7.784695709041462e-03;
32     const double d2 =  3.224671290700398e-01;
33     const double d3 =  2.445134137142996e+00;
34     const double d4 =  3.754408661907416e+00;
35 
36     const double plow = 0.02425;
37     const double phigh = 1.0 - plow;
38 
39     double q, r;
40     if (p < plow) {
41         q = std::sqrt(-2.0 * std::log(p));
42         return (((((c1*q + c2)*q + c3)*q + c4)*q + c5)*q + c6) /
43                ((((d1*q + d2)*q + d3)*q + d4)*q + 1.0);
44     } else if (phigh < p) {
45         q = std::sqrt(-2.0 * std::log(1.0 - p));
46         return -(((((c1*q + c2)*q + c3)*q + c4)*q + c5)*q + c6) /
47                  ((((d1*q + d2)*q + d3)*q + d4)*q + 1.0);
48     } else {
49         q = p - 0.5;
50         r = q * q;
51         return (((((a1*r + a2)*r + a3)*r + a4)*r + a5)*r + a6) * q /
52                (((((b1*r + b2)*r + b3)*r + b4)*r + b5)*r + 1.0);
53     }
54 }
55 
56 // Approximate t-quantile via Cornish-Fisher
57 static double t_quantile(double p, double df) {
58     if (df <= 1.5) throw std::invalid_argument("degrees of freedom must be > 1.5");
59     double z = inv_norm_cdf(p);
60     double z3 = z*z*z;
61     double z5 = z3*z*z;
62     double z7 = z5*z*z;
63     double a = (z3 + z) / (4.0*df);
64     double b = (5.0*z5 + 16.0*z3 + 3.0*z) / (96.0*df*df);
65     double c = (3.0*z7 + 19.0*z5 + 17.0*z3 - 15.0*z) / (384.0*df*df*df);
66     return z + a + b + c;
67 }
68 
69 struct RegressionIntervals {
70     double beta0;
71     double beta1;
72     double sigma;
73     double mean_ci_low;
74     double mean_ci_high;
75     double pred_pi_low;
76     double pred_pi_high;
77 };
78 
79 RegressionIntervals compute_regression_intervals(const std::vector<double>& x,
80                                                  const std::vector<double>& y,
81                                                  double x0,
82                                                  double alpha = 0.05) {
83     size_t n = x.size();
84     if (n != y.size() || n < 3) throw std::invalid_argument("Need n >= 3 and matching x,y sizes");
85 
86     double sx = std::accumulate(x.begin(), x.end(), 0.0);
87     double sy = std::accumulate(y.begin(), y.end(), 0.0);
88     double sxx = 0.0, sxy = 0.0;
89     for (size_t i = 0; i < n; ++i) {
90         sxx += x[i] * x[i];
91         sxy += x[i] * y[i];
92     }
93     double xbar = sx / n;
94     double ybar = sy / n;
95 
96     double Sxx = 0.0, Sxy = 0.0;
97     for (size_t i = 0; i < n; ++i) {
98         Sxx += (x[i] - xbar) * (x[i] - xbar);
99         Sxy += (x[i] - xbar) * (y[i] - ybar);
100     }
101     if (Sxx == 0.0) throw std::runtime_error("All x values are identical; cannot fit slope");
102 
103     double beta1 = Sxy / Sxx;
104     double beta0 = ybar - beta1 * xbar;
105 
106     // Residual variance estimate
107     double rss = 0.0;
108     for (size_t i = 0; i < n; ++i) {
109         double e = y[i] - (beta0 + beta1 * x[i]);
110         rss += e * e;
111     }
112     double sigma = std::sqrt(rss / static_cast<double>(n - 2));
113 
114     // Prediction at x0
115     double yhat0 = beta0 + beta1 * x0;
116 
117     // t critical with df = n - 2
118     double tcrit = t_quantile(1.0 - alpha/2.0, static_cast<double>(n - 2));
119 
120     // Standard errors
121     double se_mean = sigma * std::sqrt( (1.0/n) + ((x0 - xbar)*(x0 - xbar))/Sxx );
122     double se_pred = sigma * std::sqrt( 1.0 + (1.0/n) + ((x0 - xbar)*(x0 - xbar))/Sxx );
123 
124     RegressionIntervals out;
125     out.beta0 = beta0;
126     out.beta1 = beta1;
127     out.sigma = sigma;
128     out.mean_ci_low = yhat0 - tcrit * se_mean;
129     out.mean_ci_high = yhat0 + tcrit * se_mean;
130     out.pred_pi_low = yhat0 - tcrit * se_pred;
131     out.pred_pi_high = yhat0 + tcrit * se_pred;
132     return out;
133 }
134 
135 int main() {
136     // Example data: x = engine size (liters), y = fuel consumption (L/100km)
137     std::vector<double> x = {1.2, 1.6, 2.0, 2.4, 3.0, 3.2, 3.6};
138     std::vector<double> y = {5.8, 6.2, 7.0, 7.5, 8.5, 9.0, 9.4};
139 
140     double x0 = 2.5;   // Predict at this engine size
141     double alpha = 0.05; // 95% intervals
142 
143     try {
144         auto res = compute_regression_intervals(x, y, x0, alpha);
145         std::cout << "Fit: y = " << res.beta0 << " + " << res.beta1 << " x\n";
146         std::cout << "Residual sigma = " << res.sigma << "\n";
147         std::cout << "95% CI for mean at x0: [" << res.mean_ci_low << ", " << res.mean_ci_high << "]\n";
148         std::cout << "95% PI for new y at x0: [" << res.pred_pi_low << ", " << res.pred_pi_high << "]\n";
149     } catch (const std::exception& e) {
150         std::cerr << "Error: " << e.what() << "\n";
151         return 1;
152     }
153 
154     return 0;
155 }
156

This program fits a simple linear regression by computing slope and intercept from summary sums, then estimates the residual standard deviation. It uses a t critical value (via a Cornish–Fisher approximation) to construct a confidence interval for the mean response at x0 and a wider prediction interval for a new observation at x0. Farther from the average x, both intervals widen due to the leverage term (x0−x̄)^2/Sxx.

Time: O(n)Space: O(1)

1	#include <iostream>
2	#include <vector>
3	#include <cmath>
4	#include <stdexcept>
5	#include <numeric>
6	#include <algorithm>
7
8	// Inverse standard normal CDF (Acklam's approximation)
9	// Returns z such that Phi(z) = p for 0 < p < 1
10	static double inv_norm_cdf(double p) {
11	if (p <= 0.0 \|\| p >= 1.0) throw std::invalid_argument("p must be in (0,1)");
12	// Coefficients for Acklam's approximation
13	const double a1 = -3.969683028665376e+01;
14	const double a2 = 2.209460984245205e+02;
15	const double a3 = -2.759285104469687e+02;
16	const double a4 = 1.383577518672690e+02;
17	const double a5 = -3.066479806614716e+01;
18	const double a6 = 2.506628277459239e+00;
19
20	const double b1 = -5.447609879822406e+01;
21	const double b2 = 1.615858368580409e+02;
22	const double b3 = -1.556989798598866e+02;
23	const double b4 = 6.680131188771972e+01;
24	const double b5 = -1.328068155288572e+01;
25
26	const double c1 = -7.784894002430293e-03;
27	const double c2 = -3.223964580411365e-01;
28	const double c3 = -2.400758277161838e+00;
29	const double c4 = -2.549732539343734e+00;
30	const double c5 = 4.374664141464968e+00;
31	const double c6 = 2.938163982698783e+00;
32
33	const double d1 = 7.784695709041462e-03;
34	const double d2 = 3.224671290700398e-01;
35	const double d3 = 2.445134137142996e+00;
36	const double d4 = 3.754408661907416e+00;
37
38	const double plow = 0.02425;
39	const double phigh = 1.0 - plow;
40
41	double q, r;
42	if (p < plow) {
43	// Rational approximation for lower region
44	q = std::sqrt(-2.0 * std::log(p));
45	return (((((c1q + c2)q + c3)q + c4)q + c5)*q + c6) /
46	((((d1q + d2)q + d3)q + d4)q + 1.0);
47	} else if (phigh < p) {
48	// Rational approximation for upper region
49	q = std::sqrt(-2.0 * std::log(1.0 - p));
50	return -(((((c1q + c2)q + c3)q + c4)q + c5)*q + c6) /
51	((((d1q + d2)q + d3)q + d4)q + 1.0);
52	} else {
53	// Rational approximation for central region
54	q = p - 0.5;
55	r = q * q;
56	return (((((a1r + a2)r + a3)r + a4)r + a5)r + a6) q /
57	(((((b1r + b2)r + b3)r + b4)r + b5)*r + 1.0);
58	}
59	}
60
61	// Approximate t-quantile using Cornish-Fisher expansion around normal
62	// Returns t such that F_t(t; df) = p for df >= 2 approximately
63	static double t_quantile(double p, double df) {
64	if (df <= 1.5) throw std::invalid_argument("degrees of freedom must be > 1.5");
65	double z = inv_norm_cdf(p);
66	// Cornish-Fisher expansion terms for Student's t (up to 1/df^2)
67	double z3 = zzz;
68	double z5 = z3zz;
69	double z7 = z5zz;
70	double a = (z3 + z) / (4.0*df);
71	double b = (5.0z5 + 16.0z3 + 3.0z) / (96.0df*df);
72	double c = (3.0z7 + 19.0z5 + 17.0z3 - 15.0z) / (384.0dfdf*df); // optional higher-order
73	return z + a + b + c; // accurate for df >= ~5; still reasonable for smaller df
74	}
75
76	struct OneSampleIntervals {
77	double mean;
78	double s;
79	double ci_low;
80	double ci_high;
81	double pi_low;
82	double pi_high;
83	};
84
85	OneSampleIntervals compute_one_sample_intervals(const std::vector<double>& x, double alpha = 0.05) {
86	size_t n = x.size();
87	if (n < 2) throw std::invalid_argument("Need at least 2 observations");
88
89	// Compute mean
90	double sum = std::accumulate(x.begin(), x.end(), 0.0);
91	double mean = sum / static_cast<double>(n);
92
93	// Compute unbiased sample variance
94	double s2 = 0.0;
95	for (double xi : x) {
96	double d = xi - mean;
97	s2 += d * d;
98	}
99	s2 /= static_cast<double>(n - 1);
100	double s = std::sqrt(s2);
101
102	// t critical for two-sided (1 - alpha) interval
103	double tcrit = t_quantile(1.0 - alpha/2.0, static_cast<double>(n - 1));
104
105	// Confidence interval for mean
106	double se_mean = s / std::sqrt(static_cast<double>(n));
107	double ci_low = mean - tcrit * se_mean;
108	double ci_high = mean + tcrit * se_mean;
109
110	// Prediction interval for one future observation
111	double se_pred = s * std::sqrt(1.0 + 1.0/static_cast<double>(n));
112	double pi_low = mean - tcrit * se_pred;
113	double pi_high = mean + tcrit * se_pred;
114
115	return {mean, s, ci_low, ci_high, pi_low, pi_high};
116	}
117
118	int main() {
119	// Example data: measured response times (ms)
120	std::vector<double> data = {102, 98, 105, 110, 95, 101, 99, 107, 103, 100};
121	double alpha = 0.05; // 95% intervals
122
123	try {
124	auto res = compute_one_sample_intervals(data, alpha);
125	std::cout << "n = " << data.size() << "\n";
126	std::cout << "Sample mean = " << res.mean << ", sample s = " << res.s << "\n";
127	std::cout << "95% CI for mean: [" << res.ci_low << ", " << res.ci_high << "]\n";
128	std::cout << "95% PI for next observation: [" << res.pi_low << ", " << res.pi_high << "]\n";
129	} catch (const std::exception& e) {
130	std::cerr << "Error: " << e.what() << "\n";
131	return 1;
132	}
133	return 0;
134	}
135