Autoregressive Models
Key Points
- â˘Autoregressive (AR) models represent a joint distribution by multiplying conditional probabilities in a fixed order, using the chain rule of probability.
- â˘In time series, an AR(p) model predicts the current value as a weighted sum of the last p values plus noise.
- â˘For discrete sequences (like text), n-gram models approximate p( | ) with a fixed-size context window and smoothing.
- â˘Training typically maximizes the log-likelihood, which becomes a sum of conditional log-probabilities across positions.
- â˘Gaussian AR(p) models can be fit with ordinary least squares, making them fast and interpretable.
- â˘Sequence generation is done step-by-step: sample or predict the next item from p( | ), append it, and repeat.
- â˘Choosing the right order p (or context size) balances bias and variance; tools like PACF and AIC help.
- â˘Always preserve causality: never use future information to predict the past (avoid data leakage).
Prerequisites
- âBasic probability and conditional probability â Autoregression relies on the chain rule and conditional distributions.
- âLinear algebra â Fitting Gaussian AR(p) via least squares requires solving linear systems.
- âTime series stationarity and autocovariance â Model selection and stability constraints for AR(p) depend on stationarity assumptions.
- âHash maps and counting â Efficient n-gram models store context counts in dictionaries for O(1) average lookups.
- âMaximum likelihood and log-likelihood â Training objective for autoregressive models is typically likelihood-based.
Detailed Explanation
Tap terms for definitions01Overview
Autoregressive models describe complex sequences by breaking them into simpler, step-by-step predictions. Instead of modeling a high-dimensional joint probability p(x_1, x_2, ..., x_n) directly, they factor it into a product of conditional probabilities p(x_i | x_{<i}). This is a principled application of the chain rule of probability and underpins many practical models in time series forecasting, natural language processing, speech, and image modeling. In time series, the classic AR(p) model assumes that the current value is a linear combination of the last p values plus random noise. In language modeling, n-gram models estimate the probability of the next token given a fixed-length context of previous tokens, often with smoothing to handle rare or unseen events.
Autoregression is powerful because it turns a hard global problem into many local ones: learn how to predict the next step from the past. Learning usually proceeds by maximizing the log-likelihood, which decomposes cleanly into a sum over positions. Generation is straightforward too: start from an initial context and repeatedly draw the next value from the learned conditional distribution. While modern deep networks also use autoregressive ideas (e.g., causal transformers), the core concept remains the same: carefully respect the order of information, and progressively build the sequence one step at a time.
02Intuition & Analogies
Imagine telling a story one sentence at a time. Each sentence you write depends on what youâve already said, not on the future. That is exactly what autoregression does: it treats a sequence like a story you create left to right. The next word (or value) is chosen based on the words (or values) already written, never peeking ahead. If youâre forecasting weather, todayâs temperature depends on recent days; if youâre composing text, the next character depends on the last few characters. This step-by-step dependency is the heart of autoregression.
A cooking analogy helps too. When making pancakes, each step depends on what youâve mixed so farâif youâve already added eggs and milk, the next step might be flour. You donât decide the third step based on a step you havenât taken yet. Autoregression formalizes this idea: given your current context (whatâs already in the bowl), it tells you the probability of each possible next ingredient.
Another way to think about it: predicting the next item is easier than modeling everything at once. If you tried to specify every possible complete text or time series, the space is enormous. But learning to guess âwhat comes nextâ from a short context is manageable and data efficient. This is why simple linear AR models often work surprisingly well in time series, and why n-gram models capture a lot of structure in language despite their simplicity. The trade-off is context length: too short and you miss important patterns; too long and you overfit or become computationally heavy.
03Formal Definition
04When to Use
Use autoregressive models when data are naturally ordered and the order conveys information: time series (finance, sensors, traffic), language (characters, words), audio waveforms, or raster-scanned images. Choose linear Gaussian AR(p) when relationships are approximately linear, noise is roughly Gaussian, and interpretability and speed matter. AR models are excellent baselines for forecasting, anomaly detection (via residuals), and control.
Use n-gram autoregressive models for small- to medium-sized text corpora, quick baselines, or when you need simple, explainable next-token probabilities. They are also useful pre-processing tools (e.g., for backoff language models) and as sanity checks before deploying complex neural models.
Autoregression is also a training principle for modern deep models (e.g., causal convolutions, RNNs, transformers) where you need to: (1) prevent information leakage from the future; (2) define a tractable likelihood; and (3) enable straightforward sampling. When you need calibrated probabilities, likelihood-based AR models are appealing. If long-range dependencies dominate and the data are plentiful, consider higher-order models or neural autoregressive architectures while preserving the autoregressive factorization.
â ď¸Common Mistakes
⢠Data leakage: Using future information to predict the present (e.g., using x_{t+1} as a feature for predicting x_t) breaks causality and leads to overly optimistic performance. Always ensure features are strictly from the past at prediction time. ⢠Mis-specified order p: Too small p underfits (misses dependencies), too large p overfits and inflates variance. Use diagnostics like PACF, information criteria (AIC/BIC), and cross-validation to select p. ⢠Instability in AR(p): Coefficients must satisfy stationarity (characteristic polynomial roots outside the unit circle). Ignoring this can cause exploding forecasts. Check or constrain parameters during estimation. ⢠Objective mismatch: For probabilistic modeling, maximize log-likelihood (or minimize negative log-likelihood), not just MSE. While related under Gaussian assumptions, they diverge when variance matters or distributions are non-Gaussian. ⢠Poor handling of initial conditions: For the first p steps, define a consistent strategy (discard, pad with start tokens, or condition on observed history). In text n-grams, forgetting to include start/end tokens biases probabilities at sequence boundaries. ⢠No smoothing for discrete models: Zero counts imply zero probabilities, making log-likelihood undefined and sampling brittle. Apply Laplace/additive or KneserâNey smoothing as appropriate. ⢠Evaluation confusion: Report average log-likelihood or perplexity for discrete models and out-of-sample forecast errors (and residual diagnostics) for time series. Do not evaluate on the training horizon beyond observed context without proper rolling-origin validation.
Key Formulas
Chain Rule Factorization
Explanation: The joint probability of an ordered sequence equals the product of conditionals that only depend on prior elements. This is exact and underlies all autoregressive models.
Log-Likelihood Decomposition
Explanation: Taking logs turns the product into a sum, making optimization and analysis easier. Training maximizes this sum over the dataset.
Gaussian AR(p) Model
Explanation: The current value is a linear combination of the last p values plus Gaussian noise. The conditional distribution is normal with mean c + and variance .
Gaussian Conditional Density
Explanation: Given the past p values, follows a normal distribution centered at the linear predictor with variance . This yields a tractable likelihood for estimation.
YuleâWalker Equations
Explanation: Under stationarity, AR(p) parameters satisfy a Toeplitz linear system built from autocovariances. Solving this system provides consistent estimates.
Categorical Probability Constraints
Explanation: Discrete next-token distributions must be valid probabilities that sum to one and are non-negative. Any parameterization must respect these constraints.
Laplace (Additive) Smoothing
Explanation: Adds a constant to each count to avoid zero probabilities and improve generalization. V is the vocabulary size and c is the context.
Perplexity
Explanation: Perplexity is the exponentiated average negative log-likelihood. Lower perplexity indicates better predictive performance on discrete sequences.
Akaike Information Criterion
Explanation: A model selection score balancing fit (log-likelihood ) and complexity (number of parameters k). Lower AIC suggests a better trade-off.
AR(p) Stationarity Condition
Explanation: For stability, all roots of the characteristic polynomial must lie outside the unit circle. This prevents explosive behavior in forecasts.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Solve A x = b with Gaussian elimination and partial pivoting 5 vector<double> solve_linear_system(vector<vector<double>> A, vector<double> b) { 6 int n = (int)A.size(); 7 for (int i = 0; i < n; ++i) A[i].push_back(b[i]); 8 // Forward elimination with partial pivoting 9 for (int col = 0; col < n; ++col) { 10 int pivot = col; 11 for (int r = col + 1; r < n; ++r) if (fabs(A[r][col]) > fabs(A[pivot][col])) pivot = r; 12 if (fabs(A[pivot][col]) < 1e-12) throw runtime_error("Singular matrix in solve_linear_system"); 13 if (pivot != col) swap(A[pivot], A[col]); 14 double div = A[col][col]; 15 for (int c = col; c <= n; ++c) A[col][c] /= div; // normalize pivot row 16 for (int r = 0; r < n; ++r) if (r != col) { 17 double factor = A[r][col]; 18 if (factor == 0.0) continue; 19 for (int c = col; c <= n; ++c) A[r][c] -= factor * A[col][c]; 20 } 21 } 22 vector<double> x(n); 23 for (int i = 0; i < n; ++i) x[i] = A[i][n]; 24 return x; 25 } 26 27 struct ARModel { 28 int p; // order 29 double c; // intercept 30 vector<double> phi; // coefficients of lags 1..p 31 double sigma2; // noise variance estimate 32 }; 33 34 // Fit AR(p) with intercept using normal equations (least squares) 35 ARModel fit_ar_least_squares(const vector<double>& x, int p) { 36 int n = (int)x.size(); 37 if (n <= p) throw runtime_error("Not enough data to fit AR(p)"); 38 int m = n - p; // number of rows in design matrix 39 int d = p + 1; // [intercept, x_{t-1}, ..., x_{t-p}] 40 41 // Compute G = X^T X (d x d) and h = X^T y (d) 42 vector<vector<double>> G(d, vector<double>(d, 0.0)); 43 vector<double> h(d, 0.0); 44 45 auto get_row = [&](int t){ // t from p..n-1 maps to row index t-p 46 vector<double> row(d); 47 row[0] = 1.0; // intercept 48 for (int k = 1; k <= p; ++k) row[k] = x[t - k]; 49 return row; 50 }; 51 52 for (int t = p; t < n; ++t) { 53 vector<double> r = get_row(t); 54 double y = x[t]; 55 for (int i = 0; i < d; ++i) { 56 h[i] += r[i] * y; 57 for (int j = 0; j < d; ++j) G[i][j] += r[i] * r[j]; 58 } 59 } 60 61 // Solve for beta = [c, phi_1..phi_p] 62 vector<double> beta = solve_linear_system(G, h); 63 64 ARModel model; 65 model.p = p; 66 model.c = beta[0]; 67 model.phi.assign(p, 0.0); 68 for (int k = 0; k < p; ++k) model.phi[k] = beta[k + 1]; 69 70 // Estimate sigma^2 from residuals 71 double sse = 0.0; 72 for (int t = p; t < n; ++t) { 73 double pred = model.c; 74 for (int k = 1; k <= p; ++k) pred += model.phi[k - 1] * x[t - k]; 75 double e = x[t] - pred; 76 sse += e * e; 77 } 78 model.sigma2 = sse / (m); // MLE for Gaussian noise (using m observations) 79 return model; 80 } 81 82 // One-step prediction given last p values (in order: x_{t-1},...,x_{t-p}) 83 double predict_next(const ARModel& model, const deque<double>& last_p) { 84 if ((int)last_p.size() != model.p) throw runtime_error("last_p must have size p"); 85 double pred = model.c; 86 for (int k = 0; k < model.p; ++k) pred += model.phi[k] * last_p[k]; 87 return pred; 88 } 89 90 // k-step ahead forecast by iteratively feeding predictions 91 vector<double> forecast_k(const ARModel& model, const vector<double>& history, int k) { 92 if ((int)history.size() < model.p) throw runtime_error("Not enough history for forecasting"); 93 deque<double> ctx; // x_{t-1},...,x_{t-p} 94 for (int i = 0; i < model.p; ++i) ctx.push_back(history[history.size() - 1 - i]); 95 vector<double> out; 96 for (int step = 0; step < k; ++step) { 97 double y = predict_next(model, ctx); 98 out.push_back(y); 99 // Update context: push front new y, pop back oldest 100 ctx.push_front(y); 101 if ((int)ctx.size() > model.p) ctx.pop_back(); 102 } 103 return out; 104 } 105 106 int main() { 107 ios::sync_with_stdio(false); 108 cin.tie(nullptr); 109 110 // Generate synthetic AR(2)-like data: sinusoid + AR structure + noise 111 int n = 300; 112 vector<double> x(n, 0.0); 113 mt19937 rng(42); 114 normal_distribution<double> noise(0.0, 0.5); 115 // True-ish process: x_t = 0.5 + 1.2 x_{t-1} - 0.5 x_{t-2} + eps_t 116 x[0] = 0.0; x[1] = 0.3; 117 for (int t = 2; t < n; ++t) { 118 x[t] = 0.5 + 1.2 * x[t - 1] - 0.5 * x[t - 2] + noise(rng); 119 } 120 121 int p = 2; 122 ARModel model = fit_ar_least_squares(x, p); 123 124 cout << fixed << setprecision(4); 125 cout << "Fitted AR(" << p << ") model:\n"; 126 cout << " c = " << model.c << "\n"; 127 for (int k = 0; k < p; ++k) cout << " phi_" << (k+1) << " = " << model.phi[k] << "\n"; 128 cout << " sigma^2 = " << model.sigma2 << "\n\n"; 129 130 // Compute Gaussian log-likelihood on training data 131 double ll = 0.0; 132 int m = n - p; 133 double two_pi = acos(-1) * 2.0; 134 for (int t = p; t < n; ++t) { 135 double mu = model.c; 136 for (int k2 = 1; k2 <= p; ++k2) mu += model.phi[k2 - 1] * x[t - k2]; 137 double e = x[t] - mu; 138 ll += -0.5 * (log(two_pi * model.sigma2) + (e * e) / model.sigma2); 139 } 140 cout << "Training log-likelihood: " << ll << " (average per-step = " << (ll / m) << ")\n\n"; 141 142 // Forecast 5 steps ahead 143 vector<double> fut = forecast_k(model, x, 5); 144 cout << "5-step ahead forecasts:\n"; 145 for (double v : fut) cout << v << ' '; 146 cout << "\n"; 147 return 0; 148 } 149
We fit a Gaussian AR(p) model with an intercept by forming normal equations (X^T X)β = X^T y and solving them via Gaussian elimination with partial pivoting. The code estimates coefficients, residual variance Ď^2, computes the Gaussian log-likelihood, and performs k-step forecasting by iteratively feeding predictions back as inputs. This demonstrates estimation, evaluation, and generation for linear autoregressive time series.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct NGram { 5 int p; // context length 6 double alpha; // Laplace smoothing 7 vector<char> idx2ch; // vocabulary mapping index -> char 8 unordered_map<char,int> ch2idx; // char -> index 9 unordered_map<string, vector<long long>> counts; // context -> counts over V 10 vector<long long> unigram; // counts for backoff or OOV safety 11 12 NGram(int p_, double alpha_) : p(p_), alpha(alpha_) {} 13 14 // Build vocabulary from text (plus start '^' and end '$') 15 void build_vocab(const string& text) { 16 unordered_set<char> S(text.begin(), text.end()); 17 S.insert('^'); S.insert('$'); 18 idx2ch.assign(S.begin(), S.end()); 19 sort(idx2ch.begin(), idx2ch.end()); 20 ch2idx.clear(); 21 for (int i = 0; i < (int)idx2ch.size(); ++i) ch2idx[idx2ch[i]] = i; 22 unigram.assign(idx2ch.size(), 0); 23 } 24 25 int V() const { return (int)idx2ch.size(); } 26 27 // Train counts from one or more lines (each treated as a sequence ending with '$') 28 void train(const vector<string>& lines) { 29 // Ensure vocabulary is built before training 30 if (idx2ch.empty()) { 31 string all; 32 for (auto& s : lines) all += s; 33 build_vocab(all); 34 } 35 counts.clear(); 36 fill(unigram.begin(), unigram.end(), 0); 37 for (const string& raw : lines) { 38 string s = raw; 39 // Build padded sequence with p start tokens 40 string context(p, '^'); 41 for (char ch : s) { 42 char c = ch; 43 if (!ch2idx.count(c)) continue; // skip unknowns (shouldn't happen after build_vocab) 44 int v = ch2idx[c]; 45 unigram[v]++; 46 auto& vec = counts[context]; 47 if (vec.empty()) vec.assign(V(), 0); 48 vec[v]++; 49 // advance context 50 context.erase(context.begin()); 51 context.push_back(c); 52 } 53 // Add end token '$' 54 char endc = '$'; 55 int v_end = ch2idx[endc]; 56 unigram[v_end]++; 57 auto& vec = counts[context]; 58 if (vec.empty()) vec.assign(V(), 0); 59 vec[v_end]++; 60 } 61 } 62 63 // Smoothed probability p(c_next | context) 64 double prob(const string& context, char nextc) const { 65 auto it = counts.find(context); 66 int v = ch2idx.at(nextc); 67 long long denom_count = 0; 68 if (it != counts.end()) { 69 const auto& vec = it->second; 70 long long num = vec[v] + (long long)0; // for clarity 71 long long sum = 0; 72 for (long long cnt : vec) sum += cnt; 73 denom_count = sum; 74 return (num + alpha) / (sum + alpha * V()); 75 } else { 76 // unseen context: back off to unigram with Laplace smoothing 77 long long sum = 0; for (auto c : unigram) sum += c; 78 return (unigram[v] + alpha) / (sum + alpha * V()); 79 } 80 } 81 82 // Sample next char from p(. | context) 83 char sample_next(const string& context, mt19937& rng) const { 84 vector<double> logits; logits.reserve(V()); 85 double sum = 0.0; 86 for (char c : idx2ch) { 87 double pr = prob(context, c); 88 logits.push_back(pr); 89 sum += pr; 90 } 91 // Normalize (should be ~1 already, but ensure numerical stability) 92 for (double& v : logits) v /= sum; 93 uniform_real_distribution<double> unif(0.0, 1.0); 94 double r = unif(rng); 95 double acc = 0.0; 96 for (int i = 0; i < (int)idx2ch.size(); ++i) { 97 acc += logits[i]; 98 if (r <= acc) return idx2ch[i]; 99 } 100 return idx2ch.back(); // fallback 101 } 102 103 // Compute log-likelihood and perplexity of a line 104 pair<double,double> evaluate(const string& raw) const { 105 string context(p, '^'); 106 double ll = 0.0; int T = 0; 107 for (char ch : raw) { 108 double pr = prob(context, ch); 109 ll += log(max(pr, 1e-15)); 110 ++T; 111 context.erase(context.begin()); 112 context.push_back(ch); 113 } 114 // include end token 115 double pr_end = prob(context, '$'); 116 ll += log(max(pr_end, 1e-15)); 117 ++T; 118 double ppl = exp(-(ll / T)); 119 return {ll, ppl}; 120 } 121 122 // Generate a sequence up to max_len or until '$' 123 string generate(int max_len, mt19937& rng) const { 124 string context(p, '^'); 125 string out; 126 for (int t = 0; t < max_len; ++t) { 127 char c = sample_next(context, rng); 128 if (c == '$') break; 129 out.push_back(c); 130 context.erase(context.begin()); 131 context.push_back(c); 132 } 133 return out; 134 } 135 }; 136 137 int main() { 138 ios::sync_with_stdio(false); 139 cin.tie(nullptr); 140 141 // Small corpus: each string is a separate training sequence 142 vector<string> corpus = { 143 "hello world", 144 "hello there", 145 "hi world", 146 "hey you", 147 "hola mundo" 148 }; 149 150 int p = 3; // 3-gram (tri-gram) 151 double alpha = 0.5; // Laplace smoothing strength 152 NGram lm(p, alpha); 153 154 // Build vocabulary and train 155 string all; for (auto& s : corpus) all += s; 156 lm.build_vocab(all); 157 lm.train(corpus); 158 159 // Evaluate on a held-out example 160 string test = "hello you"; 161 auto [ll, ppl] = lm.evaluate(test); 162 cout << fixed << setprecision(4); 163 cout << "Test: '" << test << "'\n"; 164 cout << " Log-likelihood = " << ll << "\n"; 165 cout << " Perplexity = " << ppl << "\n\n"; 166 167 // Generate samples 168 mt19937 rng(123); 169 for (int i = 0; i < 3; ++i) { 170 string s = lm.generate(30, rng); 171 cout << "Sample " << (i+1) << ": " << s << "\n"; 172 } 173 174 // Show a conditional probability example 175 string ctx = string(p, '^'); // start-of-sequence context 176 cout << "\nNext-char probabilities from start context (top 5 by prob):\n"; 177 vector<pair<double,char>> probs; 178 for (char c : lm.idx2ch) { 179 if (c == '^') continue; // skip start token display 180 probs.push_back({ lm.prob(ctx, c), c }); 181 } 182 sort(probs.begin(), probs.end(), greater<>()); 183 for (int i = 0; i < (int)min<size_t>(5, probs.size()); ++i) { 184 cout << " '" << probs[i].second << "' : " << probs[i].first << "\n"; 185 } 186 return 0; 187 } 188
This example trains a character-level p-gram (n-gram) autoregressive model with Laplace smoothing. It constructs p-length contexts, accumulates counts, converts them into smoothed conditional probabilities, evaluates log-likelihood and perplexity on a test string, and generates new samples by iteratively sampling the next character from p(x_i | x_{i-p:i-1}). It demonstrates the autoregressive factorization for discrete sequences and how to avoid zero probabilities via smoothing.