🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
📚TheoryIntermediate

Autoregressive Models

Key Points

  • •
    Autoregressive (AR) models represent a joint distribution by multiplying conditional probabilities in a fixed order, using the chain rule of probability.
  • •
    In time series, an AR(p) model predicts the current value as a weighted sum of the last p values plus noise.
  • •
    For discrete sequences (like text), n-gram models approximate p(xi​ | x<i​) with a fixed-size context window and smoothing.
  • •
    Training typically maximizes the log-likelihood, which becomes a sum of conditional log-probabilities across positions.
  • •
    Gaussian AR(p) models can be fit with ordinary least squares, making them fast and interpretable.
  • •
    Sequence generation is done step-by-step: sample or predict the next item from p(xi​ | x<i​), append it, and repeat.
  • •
    Choosing the right order p (or context size) balances bias and variance; tools like PACF and AIC help.
  • •
    Always preserve causality: never use future information to predict the past (avoid data leakage).

Prerequisites

  • →Basic probability and conditional probability — Autoregression relies on the chain rule and conditional distributions.
  • →Linear algebra — Fitting Gaussian AR(p) via least squares requires solving linear systems.
  • →Time series stationarity and autocovariance — Model selection and stability constraints for AR(p) depend on stationarity assumptions.
  • →Hash maps and counting — Efficient n-gram models store context counts in dictionaries for O(1) average lookups.
  • →Maximum likelihood and log-likelihood — Training objective for autoregressive models is typically likelihood-based.

Detailed Explanation

Tap terms for definitions

01Overview

Autoregressive models describe complex sequences by breaking them into simpler, step-by-step predictions. Instead of modeling a high-dimensional joint probability p(x_1, x_2, ..., x_n) directly, they factor it into a product of conditional probabilities p(x_i | x_{<i}). This is a principled application of the chain rule of probability and underpins many practical models in time series forecasting, natural language processing, speech, and image modeling. In time series, the classic AR(p) model assumes that the current value is a linear combination of the last p values plus random noise. In language modeling, n-gram models estimate the probability of the next token given a fixed-length context of previous tokens, often with smoothing to handle rare or unseen events.

Autoregression is powerful because it turns a hard global problem into many local ones: learn how to predict the next step from the past. Learning usually proceeds by maximizing the log-likelihood, which decomposes cleanly into a sum over positions. Generation is straightforward too: start from an initial context and repeatedly draw the next value from the learned conditional distribution. While modern deep networks also use autoregressive ideas (e.g., causal transformers), the core concept remains the same: carefully respect the order of information, and progressively build the sequence one step at a time.

02Intuition & Analogies

Imagine telling a story one sentence at a time. Each sentence you write depends on what you’ve already said, not on the future. That is exactly what autoregression does: it treats a sequence like a story you create left to right. The next word (or value) is chosen based on the words (or values) already written, never peeking ahead. If you’re forecasting weather, today’s temperature depends on recent days; if you’re composing text, the next character depends on the last few characters. This step-by-step dependency is the heart of autoregression.

A cooking analogy helps too. When making pancakes, each step depends on what you’ve mixed so far—if you’ve already added eggs and milk, the next step might be flour. You don’t decide the third step based on a step you haven’t taken yet. Autoregression formalizes this idea: given your current context (what’s already in the bowl), it tells you the probability of each possible next ingredient.

Another way to think about it: predicting the next item is easier than modeling everything at once. If you tried to specify every possible complete text or time series, the space is enormous. But learning to guess “what comes next” from a short context is manageable and data efficient. This is why simple linear AR models often work surprisingly well in time series, and why n-gram models capture a lot of structure in language despite their simplicity. The trade-off is context length: too short and you miss important patterns; too long and you overfit or become computationally heavy.

03Formal Definition

Given an ordered sequence x1:n​ = (x1​, x2​, …, xn​), the chain rule of probability yields the exact factorization \[ p(x1:n​) = ∏i=1n​ p(xi​ ∣ x1:i−1​). \] An autoregressive model chooses and parameterizes each conditional p(xi​ ∣ x1:i−1​) in a way that depends only on past values. A common approximation is the order-p Markov property: p(xi​ ∣ x1:i−1​) = p(xi​ ∣ xi−p:i−1​), meaning only the last p items matter. For continuous time series, the Gaussian AR(p) model specifies \[ xt​=c + ∑k=1p​ ϕk​ xt−k​ + εt​, εt​ ∼ N(0, σ2), \] so that the conditional distribution is normal with mean c + ∑k​ ϕk​ xt−k​ and variance σ2. For discrete sequences over a vocabulary of size V, an n-gram model with context size p estimates \[ p(xi​ ∣ xi−p:i−1​) = Categorical(π(xi−p:i−1​)), \] where π is a probability vector over the V tokens, typically computed from counts with smoothing, e.g., Laplace/Dirichlet prior. Training usually maximizes the log-likelihood ∑i​ log p(xi​ ∣ x<i​), and generation proceeds by sampling sequentially from these conditionals, respecting causality.

04When to Use

Use autoregressive models when data are naturally ordered and the order conveys information: time series (finance, sensors, traffic), language (characters, words), audio waveforms, or raster-scanned images. Choose linear Gaussian AR(p) when relationships are approximately linear, noise is roughly Gaussian, and interpretability and speed matter. AR models are excellent baselines for forecasting, anomaly detection (via residuals), and control.

Use n-gram autoregressive models for small- to medium-sized text corpora, quick baselines, or when you need simple, explainable next-token probabilities. They are also useful pre-processing tools (e.g., for backoff language models) and as sanity checks before deploying complex neural models.

Autoregression is also a training principle for modern deep models (e.g., causal convolutions, RNNs, transformers) where you need to: (1) prevent information leakage from the future; (2) define a tractable likelihood; and (3) enable straightforward sampling. When you need calibrated probabilities, likelihood-based AR models are appealing. If long-range dependencies dominate and the data are plentiful, consider higher-order models or neural autoregressive architectures while preserving the autoregressive factorization.

⚠️Common Mistakes

• Data leakage: Using future information to predict the present (e.g., using x_{t+1} as a feature for predicting x_t) breaks causality and leads to overly optimistic performance. Always ensure features are strictly from the past at prediction time. • Mis-specified order p: Too small p underfits (misses dependencies), too large p overfits and inflates variance. Use diagnostics like PACF, information criteria (AIC/BIC), and cross-validation to select p. • Instability in AR(p): Coefficients must satisfy stationarity (characteristic polynomial roots outside the unit circle). Ignoring this can cause exploding forecasts. Check or constrain parameters during estimation. • Objective mismatch: For probabilistic modeling, maximize log-likelihood (or minimize negative log-likelihood), not just MSE. While related under Gaussian assumptions, they diverge when variance matters or distributions are non-Gaussian. • Poor handling of initial conditions: For the first p steps, define a consistent strategy (discard, pad with start tokens, or condition on observed history). In text n-grams, forgetting to include start/end tokens biases probabilities at sequence boundaries. • No smoothing for discrete models: Zero counts imply zero probabilities, making log-likelihood undefined and sampling brittle. Apply Laplace/additive or Kneser–Ney smoothing as appropriate. • Evaluation confusion: Report average log-likelihood or perplexity for discrete models and out-of-sample forecast errors (and residual diagnostics) for time series. Do not evaluate on the training horizon beyond observed context without proper rolling-origin validation.

Key Formulas

Chain Rule Factorization

p(x1:n​)=i=1∏n​p(xi​∣x1:i−1​)

Explanation: The joint probability of an ordered sequence equals the product of conditionals that only depend on prior elements. This is exact and underlies all autoregressive models.

Log-Likelihood Decomposition

logp(x1:n​)=i=1∑n​logp(xi​∣x<i​)

Explanation: Taking logs turns the product into a sum, making optimization and analysis easier. Training maximizes this sum over the dataset.

Gaussian AR(p) Model

xt​=c+k=1∑p​ϕk​xt−k​+εt​,εt​∼N(0,σ2)

Explanation: The current value is a linear combination of the last p values plus Gaussian noise. The conditional distribution is normal with mean c + ∑k​ ϕk​ xt−k​ and variance σ2.

Gaussian Conditional Density

p(xt​∣xt−1:t−p​)=2πσ2​1​exp(−2σ2(xt​−c−∑k=1p​ϕk​xt−k​)2​)

Explanation: Given the past p values, xt​ follows a normal distribution centered at the linear predictor with variance σ2. This yields a tractable likelihood for estimation.

Yule–Walker Equations

Rϕ=r,Rij​=γ∣i−j∣​, ri​=γi​

Explanation: Under stationarity, AR(p) parameters satisfy a Toeplitz linear system built from autocovariances. Solving this system provides consistent estimates.

Categorical Probability Constraints

v=1∏V​πv​=1,πv​≥0

Explanation: Discrete next-token distributions must be valid probabilities that sum to one and are non-negative. Any parameterization must respect these constraints.

Laplace (Additive) Smoothing

pα​(xi​∣c)=count(c)+αVcount(c,xi​)+α​

Explanation: Adds a constant Îą to each count to avoid zero probabilities and improve generalization. V is the vocabulary size and c is the context.

Perplexity

PP=exp(−n1​i=1∑n​logp(xi​∣x<i​))

Explanation: Perplexity is the exponentiated average negative log-likelihood. Lower perplexity indicates better predictive performance on discrete sequences.

Akaike Information Criterion

AIC=2k−2logL^

Explanation: A model selection score balancing fit (log-likelihood L^) and complexity (number of parameters k). Lower AIC suggests a better trade-off.

AR(p) Stationarity Condition

j=1∏p​∣zj​∣>1 with zj​ roots of 1−k=1∑p​ϕk​zk=0

Explanation: For stability, all roots of the characteristic polynomial must lie outside the unit circle. This prevents explosive behavior in forecasts.

Complexity Analysis

For a Gaussian AR(p) model fit by ordinary least squares with an intercept, we build normal equations (XT X)β = XT y, where X is an (n - p) by (p + 1) design matrix and y are targets. Forming XT X naively costs O((n - p)(p + 1)^2) time and O((p + 1)^2) space. Solving the (p + 1) × (p + 1) system by Gaussian elimination with partial pivoting costs O(p3) time and O(p2) space. Since p is typically small, the overall complexity is dominated by O(n p2). Forecasting k steps ahead is O(k p) time and O(p) space, as each step needs the last p values. For an n-gram autoregressive model over a vocabulary of size V and context length p, single-pass counting over a corpus of length N runs in O(N) time. Memory depends on the number of distinct contexts; in the worst case it is O(min(N, Vp) ⋅ V) if we store a full V-length distribution per observed context. Computing a smoothed probability for a single next token is O(1) given precomputed counts, while sampling from the full distribution is O(V) due to normalization and cumulative sampling. Sequence generation of length L thus costs O(L V) time and O(p + V) space per step, assuming hash map lookups for contexts are O(1) on average. In practice, choosing moderate p keeps both AR(p) and n-gram models lightweight. The AR(p) fit scales linearly with data length and cubically with p (small), and n-grams scale linearly in corpus size with a memory–accuracy trade-off controlled by p and smoothing.

Code Examples

Gaussian AR(p) time series: fit by least squares and multi-step forecasting
1#include <bits/stdc++.h>
2using namespace std;
3
4// Solve A x = b with Gaussian elimination and partial pivoting
5vector<double> solve_linear_system(vector<vector<double>> A, vector<double> b) {
6 int n = (int)A.size();
7 for (int i = 0; i < n; ++i) A[i].push_back(b[i]);
8 // Forward elimination with partial pivoting
9 for (int col = 0; col < n; ++col) {
10 int pivot = col;
11 for (int r = col + 1; r < n; ++r) if (fabs(A[r][col]) > fabs(A[pivot][col])) pivot = r;
12 if (fabs(A[pivot][col]) < 1e-12) throw runtime_error("Singular matrix in solve_linear_system");
13 if (pivot != col) swap(A[pivot], A[col]);
14 double div = A[col][col];
15 for (int c = col; c <= n; ++c) A[col][c] /= div; // normalize pivot row
16 for (int r = 0; r < n; ++r) if (r != col) {
17 double factor = A[r][col];
18 if (factor == 0.0) continue;
19 for (int c = col; c <= n; ++c) A[r][c] -= factor * A[col][c];
20 }
21 }
22 vector<double> x(n);
23 for (int i = 0; i < n; ++i) x[i] = A[i][n];
24 return x;
25}
26
27struct ARModel {
28 int p; // order
29 double c; // intercept
30 vector<double> phi; // coefficients of lags 1..p
31 double sigma2; // noise variance estimate
32};
33
34// Fit AR(p) with intercept using normal equations (least squares)
35ARModel fit_ar_least_squares(const vector<double>& x, int p) {
36 int n = (int)x.size();
37 if (n <= p) throw runtime_error("Not enough data to fit AR(p)");
38 int m = n - p; // number of rows in design matrix
39 int d = p + 1; // [intercept, x_{t-1}, ..., x_{t-p}]
40
41 // Compute G = X^T X (d x d) and h = X^T y (d)
42 vector<vector<double>> G(d, vector<double>(d, 0.0));
43 vector<double> h(d, 0.0);
44
45 auto get_row = [&](int t){ // t from p..n-1 maps to row index t-p
46 vector<double> row(d);
47 row[0] = 1.0; // intercept
48 for (int k = 1; k <= p; ++k) row[k] = x[t - k];
49 return row;
50 };
51
52 for (int t = p; t < n; ++t) {
53 vector<double> r = get_row(t);
54 double y = x[t];
55 for (int i = 0; i < d; ++i) {
56 h[i] += r[i] * y;
57 for (int j = 0; j < d; ++j) G[i][j] += r[i] * r[j];
58 }
59 }
60
61 // Solve for beta = [c, phi_1..phi_p]
62 vector<double> beta = solve_linear_system(G, h);
63
64 ARModel model;
65 model.p = p;
66 model.c = beta[0];
67 model.phi.assign(p, 0.0);
68 for (int k = 0; k < p; ++k) model.phi[k] = beta[k + 1];
69
70 // Estimate sigma^2 from residuals
71 double sse = 0.0;
72 for (int t = p; t < n; ++t) {
73 double pred = model.c;
74 for (int k = 1; k <= p; ++k) pred += model.phi[k - 1] * x[t - k];
75 double e = x[t] - pred;
76 sse += e * e;
77 }
78 model.sigma2 = sse / (m); // MLE for Gaussian noise (using m observations)
79 return model;
80}
81
82// One-step prediction given last p values (in order: x_{t-1},...,x_{t-p})
83double predict_next(const ARModel& model, const deque<double>& last_p) {
84 if ((int)last_p.size() != model.p) throw runtime_error("last_p must have size p");
85 double pred = model.c;
86 for (int k = 0; k < model.p; ++k) pred += model.phi[k] * last_p[k];
87 return pred;
88}
89
90// k-step ahead forecast by iteratively feeding predictions
91vector<double> forecast_k(const ARModel& model, const vector<double>& history, int k) {
92 if ((int)history.size() < model.p) throw runtime_error("Not enough history for forecasting");
93 deque<double> ctx; // x_{t-1},...,x_{t-p}
94 for (int i = 0; i < model.p; ++i) ctx.push_back(history[history.size() - 1 - i]);
95 vector<double> out;
96 for (int step = 0; step < k; ++step) {
97 double y = predict_next(model, ctx);
98 out.push_back(y);
99 // Update context: push front new y, pop back oldest
100 ctx.push_front(y);
101 if ((int)ctx.size() > model.p) ctx.pop_back();
102 }
103 return out;
104}
105
106int main() {
107 ios::sync_with_stdio(false);
108 cin.tie(nullptr);
109
110 // Generate synthetic AR(2)-like data: sinusoid + AR structure + noise
111 int n = 300;
112 vector<double> x(n, 0.0);
113 mt19937 rng(42);
114 normal_distribution<double> noise(0.0, 0.5);
115 // True-ish process: x_t = 0.5 + 1.2 x_{t-1} - 0.5 x_{t-2} + eps_t
116 x[0] = 0.0; x[1] = 0.3;
117 for (int t = 2; t < n; ++t) {
118 x[t] = 0.5 + 1.2 * x[t - 1] - 0.5 * x[t - 2] + noise(rng);
119 }
120
121 int p = 2;
122 ARModel model = fit_ar_least_squares(x, p);
123
124 cout << fixed << setprecision(4);
125 cout << "Fitted AR(" << p << ") model:\n";
126 cout << " c = " << model.c << "\n";
127 for (int k = 0; k < p; ++k) cout << " phi_" << (k+1) << " = " << model.phi[k] << "\n";
128 cout << " sigma^2 = " << model.sigma2 << "\n\n";
129
130 // Compute Gaussian log-likelihood on training data
131 double ll = 0.0;
132 int m = n - p;
133 double two_pi = acos(-1) * 2.0;
134 for (int t = p; t < n; ++t) {
135 double mu = model.c;
136 for (int k2 = 1; k2 <= p; ++k2) mu += model.phi[k2 - 1] * x[t - k2];
137 double e = x[t] - mu;
138 ll += -0.5 * (log(two_pi * model.sigma2) + (e * e) / model.sigma2);
139 }
140 cout << "Training log-likelihood: " << ll << " (average per-step = " << (ll / m) << ")\n\n";
141
142 // Forecast 5 steps ahead
143 vector<double> fut = forecast_k(model, x, 5);
144 cout << "5-step ahead forecasts:\n";
145 for (double v : fut) cout << v << ' ';
146 cout << "\n";
147 return 0;
148}
149

We fit a Gaussian AR(p) model with an intercept by forming normal equations (X^T X)β = X^T y and solving them via Gaussian elimination with partial pivoting. The code estimates coefficients, residual variance σ^2, computes the Gaussian log-likelihood, and performs k-step forecasting by iteratively feeding predictions back as inputs. This demonstrates estimation, evaluation, and generation for linear autoregressive time series.

Time: O(n p^2 + p^3) for fitting; O(k p) for k-step forecastingSpace: O(p^2) for fitting; O(p) during forecasting
Character-level n-gram autoregressive model with Laplace smoothing and sampling
1#include <bits/stdc++.h>
2using namespace std;
3
4struct NGram {
5 int p; // context length
6 double alpha; // Laplace smoothing
7 vector<char> idx2ch; // vocabulary mapping index -> char
8 unordered_map<char,int> ch2idx; // char -> index
9 unordered_map<string, vector<long long>> counts; // context -> counts over V
10 vector<long long> unigram; // counts for backoff or OOV safety
11
12 NGram(int p_, double alpha_) : p(p_), alpha(alpha_) {}
13
14 // Build vocabulary from text (plus start '^' and end '$')
15 void build_vocab(const string& text) {
16 unordered_set<char> S(text.begin(), text.end());
17 S.insert('^'); S.insert('$');
18 idx2ch.assign(S.begin(), S.end());
19 sort(idx2ch.begin(), idx2ch.end());
20 ch2idx.clear();
21 for (int i = 0; i < (int)idx2ch.size(); ++i) ch2idx[idx2ch[i]] = i;
22 unigram.assign(idx2ch.size(), 0);
23 }
24
25 int V() const { return (int)idx2ch.size(); }
26
27 // Train counts from one or more lines (each treated as a sequence ending with '$')
28 void train(const vector<string>& lines) {
29 // Ensure vocabulary is built before training
30 if (idx2ch.empty()) {
31 string all;
32 for (auto& s : lines) all += s;
33 build_vocab(all);
34 }
35 counts.clear();
36 fill(unigram.begin(), unigram.end(), 0);
37 for (const string& raw : lines) {
38 string s = raw;
39 // Build padded sequence with p start tokens
40 string context(p, '^');
41 for (char ch : s) {
42 char c = ch;
43 if (!ch2idx.count(c)) continue; // skip unknowns (shouldn't happen after build_vocab)
44 int v = ch2idx[c];
45 unigram[v]++;
46 auto& vec = counts[context];
47 if (vec.empty()) vec.assign(V(), 0);
48 vec[v]++;
49 // advance context
50 context.erase(context.begin());
51 context.push_back(c);
52 }
53 // Add end token '$'
54 char endc = '$';
55 int v_end = ch2idx[endc];
56 unigram[v_end]++;
57 auto& vec = counts[context];
58 if (vec.empty()) vec.assign(V(), 0);
59 vec[v_end]++;
60 }
61 }
62
63 // Smoothed probability p(c_next | context)
64 double prob(const string& context, char nextc) const {
65 auto it = counts.find(context);
66 int v = ch2idx.at(nextc);
67 long long denom_count = 0;
68 if (it != counts.end()) {
69 const auto& vec = it->second;
70 long long num = vec[v] + (long long)0; // for clarity
71 long long sum = 0;
72 for (long long cnt : vec) sum += cnt;
73 denom_count = sum;
74 return (num + alpha) / (sum + alpha * V());
75 } else {
76 // unseen context: back off to unigram with Laplace smoothing
77 long long sum = 0; for (auto c : unigram) sum += c;
78 return (unigram[v] + alpha) / (sum + alpha * V());
79 }
80 }
81
82 // Sample next char from p(. | context)
83 char sample_next(const string& context, mt19937& rng) const {
84 vector<double> logits; logits.reserve(V());
85 double sum = 0.0;
86 for (char c : idx2ch) {
87 double pr = prob(context, c);
88 logits.push_back(pr);
89 sum += pr;
90 }
91 // Normalize (should be ~1 already, but ensure numerical stability)
92 for (double& v : logits) v /= sum;
93 uniform_real_distribution<double> unif(0.0, 1.0);
94 double r = unif(rng);
95 double acc = 0.0;
96 for (int i = 0; i < (int)idx2ch.size(); ++i) {
97 acc += logits[i];
98 if (r <= acc) return idx2ch[i];
99 }
100 return idx2ch.back(); // fallback
101 }
102
103 // Compute log-likelihood and perplexity of a line
104 pair<double,double> evaluate(const string& raw) const {
105 string context(p, '^');
106 double ll = 0.0; int T = 0;
107 for (char ch : raw) {
108 double pr = prob(context, ch);
109 ll += log(max(pr, 1e-15));
110 ++T;
111 context.erase(context.begin());
112 context.push_back(ch);
113 }
114 // include end token
115 double pr_end = prob(context, '$');
116 ll += log(max(pr_end, 1e-15));
117 ++T;
118 double ppl = exp(-(ll / T));
119 return {ll, ppl};
120 }
121
122 // Generate a sequence up to max_len or until '$'
123 string generate(int max_len, mt19937& rng) const {
124 string context(p, '^');
125 string out;
126 for (int t = 0; t < max_len; ++t) {
127 char c = sample_next(context, rng);
128 if (c == '$') break;
129 out.push_back(c);
130 context.erase(context.begin());
131 context.push_back(c);
132 }
133 return out;
134 }
135};
136
137int main() {
138 ios::sync_with_stdio(false);
139 cin.tie(nullptr);
140
141 // Small corpus: each string is a separate training sequence
142 vector<string> corpus = {
143 "hello world",
144 "hello there",
145 "hi world",
146 "hey you",
147 "hola mundo"
148 };
149
150 int p = 3; // 3-gram (tri-gram)
151 double alpha = 0.5; // Laplace smoothing strength
152 NGram lm(p, alpha);
153
154 // Build vocabulary and train
155 string all; for (auto& s : corpus) all += s;
156 lm.build_vocab(all);
157 lm.train(corpus);
158
159 // Evaluate on a held-out example
160 string test = "hello you";
161 auto [ll, ppl] = lm.evaluate(test);
162 cout << fixed << setprecision(4);
163 cout << "Test: '" << test << "'\n";
164 cout << " Log-likelihood = " << ll << "\n";
165 cout << " Perplexity = " << ppl << "\n\n";
166
167 // Generate samples
168 mt19937 rng(123);
169 for (int i = 0; i < 3; ++i) {
170 string s = lm.generate(30, rng);
171 cout << "Sample " << (i+1) << ": " << s << "\n";
172 }
173
174 // Show a conditional probability example
175 string ctx = string(p, '^'); // start-of-sequence context
176 cout << "\nNext-char probabilities from start context (top 5 by prob):\n";
177 vector<pair<double,char>> probs;
178 for (char c : lm.idx2ch) {
179 if (c == '^') continue; // skip start token display
180 probs.push_back({ lm.prob(ctx, c), c });
181 }
182 sort(probs.begin(), probs.end(), greater<>());
183 for (int i = 0; i < (int)min<size_t>(5, probs.size()); ++i) {
184 cout << " '" << probs[i].second << "' : " << probs[i].first << "\n";
185 }
186 return 0;
187}
188

This example trains a character-level p-gram (n-gram) autoregressive model with Laplace smoothing. It constructs p-length contexts, accumulates counts, converts them into smoothed conditional probabilities, evaluates log-likelihood and perplexity on a test string, and generates new samples by iteratively sampling the next character from p(x_i | x_{i-p:i-1}). It demonstrates the autoregressive factorization for discrete sequences and how to avoid zero probabilities via smoothing.

Time: O(N) to count over corpus length N; O(V) per sample step due to normalizationSpace: O(U * V) where U is the number of distinct contexts observed; O(V) for vocabulary storage
#autoregressive#ar model#n-gram#chain rule#time series#language modeling#maximum likelihood#laplace smoothing#perplexity#stationarity#yule-walker#forecasting#causal modeling#markov order