📚TheoryIntermediate

Autoregressive Models

Key Points

•
Autoregressive (AR) models represent a joint distribution by multiplying conditional probabilities in a fixed order, using the chain rule of probability.
•
In time series, an AR(p) model predicts the current value as a weighted sum of the last p values plus noise.
•
For discrete sequences (like text), n-gram models approximate p( $x_{i}$ | $x_{< i}$ ) with a fixed-size context window and smoothing.
•
Training typically maximizes the log-likelihood, which becomes a sum of conditional log-probabilities across positions.
•
Gaussian AR(p) models can be fit with ordinary least squares, making them fast and interpretable.
•
Sequence generation is done step-by-step: sample or predict the next item from p( $x_{i}$ | $x_{< i}$ ), append it, and repeat.
•
Choosing the right order p (or context size) balances bias and variance; tools like PACF and AIC help.
•
Always preserve causality: never use future information to predict the past (avoid data leakage).

Prerequisites

→Basic probability and conditional probability — Autoregression relies on the chain rule and conditional distributions.
→Linear algebra — Fitting Gaussian AR(p) via least squares requires solving linear systems.
→Time series stationarity and autocovariance — Model selection and stability constraints for AR(p) depend on stationarity assumptions.
→Hash maps and counting — Efficient n-gram models store context counts in dictionaries for O(1) average lookups.
→Maximum likelihood and log-likelihood — Training objective for autoregressive models is typically likelihood-based.

Detailed Explanation

Tap terms for definitions

01Overview

Autoregressive models describe complex sequences by breaking them into simpler, step-by-step predictions. Instead of modeling a high-dimensional joint probability p( $x_1$ , $x_2$ , ..., $x_n$ ) directly, they factor it into a product of conditional probabilities p( $x_i$ | $x_{<i}$ ). This is a principled application of the chain rule of probability and underpins many practical models in time series forecasting, natural language processing, speech, and image modeling. In time series, the classic AR(p) model assumes that the current value is a linear combination of the last p values plus random noise. In language modeling, n-gram models estimate the probability of the next token given a fixed-length context of previous tokens, often with smoothing to handle rare or unseen events.

Autoregression is powerful because it turns a hard global problem into many local ones: learn how to predict the next step from the past. Learning usually proceeds by maximizing the log-likelihood, which decomposes cleanly into a sum over positions. Generation is straightforward too: start from an initial context and repeatedly draw the next value from the learned conditional distribution. While modern deep networks also use autoregressive ideas (e.g., causal transformers), the core concept remains the same: carefully respect the order of information, and progressively build the sequence one step at a time.

02Intuition & Analogies

Imagine telling a story one sentence at a time. Each sentence you write depends on what you’ve already said, not on the future. That is exactly what autoregression does: it treats a sequence like a story you create left to right. The next word (or value) is chosen based on the words (or values) already written, never peeking ahead. If you’re forecasting weather, today’s temperature depends on recent days; if you’re composing text, the next character depends on the last few characters. This step-by-step dependency is the heart of autoregression.

A cooking analogy helps too. When making pancakes, each step depends on what you’ve mixed so far—if you’ve already added eggs and milk, the next step might be flour. You don’t decide the third step based on a step you haven’t taken yet. Autoregression formalizes this idea: given your current context (what’s already in the bowl), it tells you the probability of each possible next ingredient.

Another way to think about it: predicting the next item is easier than modeling everything at once. If you tried to specify every possible complete text or time series, the space is enormous. But learning to guess “what comes next” from a short context is manageable and data efficient. This is why simple linear AR models often work surprisingly well in time series, and why n-gram models capture a lot of structure in language despite their simplicity. The trade-off is context length: too short and you miss important patterns; too long and you overfit or become computationally heavy.

03Formal Definition

Given an ordered sequence

x_{1 : n}

= (

x_{1}

x_{2}

\dots

x_{n}

), the chain rule of probability yields the exact factorization \[ p(

x_{1 : n}

) =

\prod_{i = 1}^{n}

x_{i}

∣

x_{1 : i - 1}

). \] An autoregressive model chooses and parameterizes each conditional p(

x_{i}

∣

x_{1 : i - 1}

) in a way that depends only on past values. A common approximation is the order-p Markov property: p(

x_{i}

∣

x_{1 : i - 1}

) = p(

x_{i}

∣

x_{i - p : i - 1}

), meaning only the last p items matter. For continuous time series, the Gaussian AR(p) model specifies \[

x_{t} = c

\sum_{k = 1}^{p}

ϕ_{k}

x_{t - k}

ε_{t}

ε_{t}

\sim

N

(0,

σ^{2}

), \] so that the conditional distribution is normal with mean c +

\sum_{k}

ϕ_{k}

x_{t - k}

and variance

σ^{2}

. For discrete sequences over a vocabulary of size V, an n-gram model with context size p estimates \[ p(

x_{i}

∣

x_{i - p : i - 1}

) =

Categorical

(

π

(

x_{i - p : i - 1}

)), \] where

π

is a probability vector over the V tokens, typically computed from counts with smoothing, e.g., Laplace/Dirichlet prior. Training usually maximizes the log-likelihood

\sum_{i}

lo g

x_{i}

∣

x_{< i}

), and generation proceeds by sampling sequentially from these conditionals, respecting causality.

04When to Use

Use autoregressive models when data are naturally ordered and the order conveys information: time series (finance, sensors, traffic), language (characters, words), audio waveforms, or raster-scanned images. Choose linear Gaussian AR(p) when relationships are approximately linear, noise is roughly Gaussian, and interpretability and speed matter. AR models are excellent baselines for forecasting, anomaly detection (via residuals), and control.

Use n-gram autoregressive models for small- to medium-sized text corpora, quick baselines, or when you need simple, explainable next-token probabilities. They are also useful pre-processing tools (e.g., for backoff language models) and as sanity checks before deploying complex neural models.

Autoregression is also a training principle for modern deep models (e.g., causal convolutions, RNNs, transformers) where you need to: (1) prevent information leakage from the future; (2) define a tractable likelihood; and (3) enable straightforward sampling. When you need calibrated probabilities, likelihood-based AR models are appealing. If long-range dependencies dominate and the data are plentiful, consider higher-order models or neural autoregressive architectures while preserving the autoregressive factorization.

⚠️Common Mistakes

• Data leakage: Using future information to predict the present (e.g., using $x_{t+1}$ as a feature for predicting $x_t$ ) breaks causality and leads to overly optimistic performance. Always ensure features are strictly from the past at prediction time. • Mis-specified order p: Too small p underfits (misses dependencies), too large p overfits and inflates variance. Use diagnostics like PACF, information criteria (AIC/BIC), and cross-validation to select p. • Instability in AR(p): Coefficients must satisfy stationarity (characteristic polynomial roots outside the unit circle). Ignoring this can cause exploding forecasts. Check or constrain parameters during estimation. • Objective mismatch: For probabilistic modeling, maximize log-likelihood (or minimize negative log-likelihood), not just MSE. While related under Gaussian assumptions, they diverge when variance matters or distributions are non-Gaussian. • Poor handling of initial conditions: For the first p steps, define a consistent strategy (discard, pad with start tokens, or condition on observed history). In text n-grams, forgetting to include start/end tokens biases probabilities at sequence boundaries. • No smoothing for discrete models: Zero counts imply zero probabilities, making log-likelihood undefined and sampling brittle. Apply Laplace/additive or Kneser–Ney smoothing as appropriate. • Evaluation confusion: Report average log-likelihood or perplexity for discrete models and out-of-sample forecast errors (and residual diagnostics) for time series. Do not evaluate on the training horizon beyond observed context without proper rolling-origin validation.

Key Formulas

Chain Rule Factorization

p (x_{1 : n}) = i = 1 \prod n p (x_{i} ∣ x_{1 : i - 1})

Explanation: The joint probability of an ordered sequence equals the product of conditionals that only depend on prior elements. This is exact and underlies all autoregressive models.

Log-Likelihood Decomposition

lo g p (x_{1 : n}) = i = 1 \sum n lo g p (x_{i} ∣ x_{< i})

Explanation: Taking logs turns the product into a sum, making optimization and analysis easier. Training maximizes this sum over the dataset.

Gaussian AR(p) Model

x_{t} = c + k = 1 \sum p ϕ_{k} x_{t - k} + ε_{t}, ε_{t} \sim N (0, σ^{2})

Explanation: The current value is a linear combination of the last p values plus Gaussian noise. The conditional distribution is normal with mean c + $\sum_{k}$ $ϕ_{k}$ $x_{t - k}$ and variance $σ^{2}$ .

Gaussian Conditional Density

p (x_{t} ∣ x_{t - 1 : t - p}) = \frac{1}{2 π σ ^{2}} exp (- \frac{( x _{t} - c - \sum _{k = 1}^{p} ϕ _{k} x _{t - k} ) ^{2}}{2 σ ^{2}})

Explanation: Given the past p values, $x_{t}$ follows a normal distribution centered at the linear predictor with variance $σ^{2}$ . This yields a tractable likelihood for estimation.

Yule–Walker Equations

R ϕ = r, R_{ij} = γ_{∣ i - j ∣}, r_{i} = γ_{i}

Explanation: Under stationarity, AR(p) parameters satisfy a Toeplitz linear system built from autocovariances. Solving this system provides consistent estimates.

Categorical Probability Constraints

v = 1 \prod V π_{v} = 1, π_{v} \geq 0

Explanation: Discrete next-token distributions must be valid probabilities that sum to one and are non-negative. Any parameterization must respect these constraints.

Laplace (Additive) Smoothing

p_{α} (x_{i} ∣ c) = \frac{count ( c , x _{i} ) + α}{count ( c ) + α V}

Explanation: Adds a constant $α$ to each count to avoid zero probabilities and improve generalization. V is the vocabulary size and c is the context.

Perplexity

PP = exp (- \frac{1}{n} i = 1 \sum n lo g p (x_{i} ∣ x_{< i}))

Explanation: Perplexity is the exponentiated average negative log-likelihood. Lower perplexity indicates better predictive performance on discrete sequences.

Akaike Information Criterion

AIC = 2 k - 2 lo g \hat{L}

Explanation: A model selection score balancing fit (log-likelihood $\hat{L}$ ) and complexity (number of parameters k). Lower AIC suggests a better trade-off.

AR(p) Stationarity Condition

j = 1 \prod p ∣ z_{j} ∣ > 1 with z_{j} roots of 1 - k = 1 \sum p ϕ_{k} z^{k} = 0

Explanation: For stability, all roots of the characteristic polynomial must lie outside the unit circle. This prevents explosive behavior in forecasts.

Complexity Analysis

For a Gaussian AR(p) model fit by ordinary least squares with an intercept, we build normal equations (

X^{T}

β

X^{T}

y, where X is an (n - p) by (p + 1) design matrix and y are targets. Forming

X^{T}

X naively costs O((n - p)(p + 1)^2) time and O((p + 1)^2) space. Solving the (p + 1)

\times

(p + 1) system by Gaussian elimination with partial pivoting costs O(

p^{3}

) time and O(

p^{2}

) space. Since p is typically small, the overall complexity is dominated by O(n

p^{2}

). Forecasting k steps ahead is O(k p) time and O(p) space, as each step needs the last p values. For an n-gram autoregressive model over a vocabulary of size V and context length p, single-pass counting over a corpus of length N runs in O(N) time. Memory depends on the number of distinct contexts; in the worst case it is O(min(N,

V^{p}

)

\cdot

V) if we store a full V-length distribution per observed context. Computing a smoothed probability for a single next token is O(1) given precomputed counts, while sampling from the full distribution is O(V) due to normalization and cumulative sampling. Sequence generation of length L thus costs O(L V) time and O(p + V) space per step, assuming hash map lookups for contexts are O(1) on average. In practice, choosing moderate p keeps both AR(p) and n-gram models lightweight. The AR(p) fit scales linearly with data length and cubically with p (small), and n-grams scale linearly in corpus size with a memory–accuracy trade-off controlled by p and smoothing.

Code Examples

Gaussian AR(p) time series: fit by least squares and multi-step forecasting

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Solve A x = b with Gaussian elimination and partial pivoting
5 vector<double> solve_linear_system(vector<vector<double>> A, vector<double> b) {
6     int n = (int)A.size();
7     for (int i = 0; i < n; ++i) A[i].push_back(b[i]);
8     // Forward elimination with partial pivoting
9     for (int col = 0; col < n; ++col) {
10         int pivot = col;
11         for (int r = col + 1; r < n; ++r) if (fabs(A[r][col]) > fabs(A[pivot][col])) pivot = r;
12         if (fabs(A[pivot][col]) < 1e-12) throw runtime_error("Singular matrix in solve_linear_system");
13         if (pivot != col) swap(A[pivot], A[col]);
14         double div = A[col][col];
15         for (int c = col; c <= n; ++c) A[col][c] /= div; // normalize pivot row
16         for (int r = 0; r < n; ++r) if (r != col) {
17             double factor = A[r][col];
18             if (factor == 0.0) continue;
19             for (int c = col; c <= n; ++c) A[r][c] -= factor * A[col][c];
20         }
21     }
22     vector<double> x(n);
23     for (int i = 0; i < n; ++i) x[i] = A[i][n];
24     return x;
25 }
26 
27 struct ARModel {
28     int p;                 // order
29     double c;              // intercept
30     vector<double> phi;    // coefficients of lags 1..p
31     double sigma2;         // noise variance estimate
32 };
33 
34 // Fit AR(p) with intercept using normal equations (least squares)
35 ARModel fit_ar_least_squares(const vector<double>& x, int p) {
36     int n = (int)x.size();
37     if (n <= p) throw runtime_error("Not enough data to fit AR(p)");
38     int m = n - p; // number of rows in design matrix
39     int d = p + 1; // [intercept, x_{t-1}, ..., x_{t-p}]
40 
41     // Compute G = X^T X (d x d) and h = X^T y (d)
42     vector<vector<double>> G(d, vector<double>(d, 0.0));
43     vector<double> h(d, 0.0);
44 
45     auto get_row = [&](int t){ // t from p..n-1 maps to row index t-p
46         vector<double> row(d);
47         row[0] = 1.0; // intercept
48         for (int k = 1; k <= p; ++k) row[k] = x[t - k];
49         return row;
50     };
51 
52     for (int t = p; t < n; ++t) {
53         vector<double> r = get_row(t);
54         double y = x[t];
55         for (int i = 0; i < d; ++i) {
56             h[i] += r[i] * y;
57             for (int j = 0; j < d; ++j) G[i][j] += r[i] * r[j];
58         }
59     }
60 
61     // Solve for beta = [c, phi_1..phi_p]
62     vector<double> beta = solve_linear_system(G, h);
63 
64     ARModel model;
65     model.p = p;
66     model.c = beta[0];
67     model.phi.assign(p, 0.0);
68     for (int k = 0; k < p; ++k) model.phi[k] = beta[k + 1];
69 
70     // Estimate sigma^2 from residuals
71     double sse = 0.0;
72     for (int t = p; t < n; ++t) {
73         double pred = model.c;
74         for (int k = 1; k <= p; ++k) pred += model.phi[k - 1] * x[t - k];
75         double e = x[t] - pred;
76         sse += e * e;
77     }
78     model.sigma2 = sse / (m); // MLE for Gaussian noise (using m observations)
79     return model;
80 }
81 
82 // One-step prediction given last p values (in order: x_{t-1},...,x_{t-p})
83 double predict_next(const ARModel& model, const deque<double>& last_p) {
84     if ((int)last_p.size() != model.p) throw runtime_error("last_p must have size p");
85     double pred = model.c;
86     for (int k = 0; k < model.p; ++k) pred += model.phi[k] * last_p[k];
87     return pred;
88 }
89 
90 // k-step ahead forecast by iteratively feeding predictions
91 vector<double> forecast_k(const ARModel& model, const vector<double>& history, int k) {
92     if ((int)history.size() < model.p) throw runtime_error("Not enough history for forecasting");
93     deque<double> ctx; // x_{t-1},...,x_{t-p}
94     for (int i = 0; i < model.p; ++i) ctx.push_back(history[history.size() - 1 - i]);
95     vector<double> out;
96     for (int step = 0; step < k; ++step) {
97         double y = predict_next(model, ctx);
98         out.push_back(y);
99         // Update context: push front new y, pop back oldest
100         ctx.push_front(y);
101         if ((int)ctx.size() > model.p) ctx.pop_back();
102     }
103     return out;
104 }
105 
106 int main() {
107     ios::sync_with_stdio(false);
108     cin.tie(nullptr);
109 
110     // Generate synthetic AR(2)-like data: sinusoid + AR structure + noise
111     int n = 300;
112     vector<double> x(n, 0.0);
113     mt19937 rng(42);
114     normal_distribution<double> noise(0.0, 0.5);
115     // True-ish process: x_t = 0.5 + 1.2 x_{t-1} - 0.5 x_{t-2} + eps_t
116     x[0] = 0.0; x[1] = 0.3;
117     for (int t = 2; t < n; ++t) {
118         x[t] = 0.5 + 1.2 * x[t - 1] - 0.5 * x[t - 2] + noise(rng);
119     }
120 
121     int p = 2;
122     ARModel model = fit_ar_least_squares(x, p);
123 
124     cout << fixed << setprecision(4);
125     cout << "Fitted AR(" << p << ") model:\n";
126     cout << "  c = " << model.c << "\n";
127     for (int k = 0; k < p; ++k) cout << "  phi_" << (k+1) << " = " << model.phi[k] << "\n";
128     cout << "  sigma^2 = " << model.sigma2 << "\n\n";
129 
130     // Compute Gaussian log-likelihood on training data
131     double ll = 0.0;
132     int m = n - p;
133     double two_pi = acos(-1) * 2.0;
134     for (int t = p; t < n; ++t) {
135         double mu = model.c;
136         for (int k2 = 1; k2 <= p; ++k2) mu += model.phi[k2 - 1] * x[t - k2];
137         double e = x[t] - mu;
138         ll += -0.5 * (log(two_pi * model.sigma2) + (e * e) / model.sigma2);
139     }
140     cout << "Training log-likelihood: " << ll << " (average per-step = " << (ll / m) << ")\n\n";
141 
142     // Forecast 5 steps ahead
143     vector<double> fut = forecast_k(model, x, 5);
144     cout << "5-step ahead forecasts:\n";
145     for (double v : fut) cout << v << ' ';
146     cout << "\n";
147     return 0;
148 }
149

We fit a Gaussian AR(p) model with an intercept by forming normal equations (X^T X)β = X^T y and solving them via Gaussian elimination with partial pivoting. The code estimates coefficients, residual variance σ^2, computes the Gaussian log-likelihood, and performs k-step forecasting by iteratively feeding predictions back as inputs. This demonstrates estimation, evaluation, and generation for linear autoregressive time series.

Time: O(n p^2 + p^3) for fitting; O(k p) for k-step forecastingSpace: O(p^2) for fitting; O(p) during forecasting

Character-level n-gram autoregressive model with Laplace smoothing and sampling

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct NGram {
5     int p;                        // context length
6     double alpha;                 // Laplace smoothing
7     vector<char> idx2ch;          // vocabulary mapping index -> char
8     unordered_map<char,int> ch2idx; // char -> index
9     unordered_map<string, vector<long long>> counts; // context -> counts over V
10     vector<long long> unigram;    // counts for backoff or OOV safety
11 
12     NGram(int p_, double alpha_) : p(p_), alpha(alpha_) {}
13 
14     // Build vocabulary from text (plus start '^' and end '$')
15     void build_vocab(const string& text) {
16         unordered_set<char> S(text.begin(), text.end());
17         S.insert('^'); S.insert('$');
18         idx2ch.assign(S.begin(), S.end());
19         sort(idx2ch.begin(), idx2ch.end());
20         ch2idx.clear();
21         for (int i = 0; i < (int)idx2ch.size(); ++i) ch2idx[idx2ch[i]] = i;
22         unigram.assign(idx2ch.size(), 0);
23     }
24 
25     int V() const { return (int)idx2ch.size(); }
26 
27     // Train counts from one or more lines (each treated as a sequence ending with '$')
28     void train(const vector<string>& lines) {
29         // Ensure vocabulary is built before training
30         if (idx2ch.empty()) {
31             string all;
32             for (auto& s : lines) all += s;
33             build_vocab(all);
34         }
35         counts.clear();
36         fill(unigram.begin(), unigram.end(), 0);
37         for (const string& raw : lines) {
38             string s = raw;
39             // Build padded sequence with p start tokens
40             string context(p, '^');
41             for (char ch : s) {
42                 char c = ch;
43                 if (!ch2idx.count(c)) continue; // skip unknowns (shouldn't happen after build_vocab)
44                 int v = ch2idx[c];
45                 unigram[v]++;
46                 auto& vec = counts[context];
47                 if (vec.empty()) vec.assign(V(), 0);
48                 vec[v]++;
49                 // advance context
50                 context.erase(context.begin());
51                 context.push_back(c);
52             }
53             // Add end token '$'
54             char endc = '$';
55             int v_end = ch2idx[endc];
56             unigram[v_end]++;
57             auto& vec = counts[context];
58             if (vec.empty()) vec.assign(V(), 0);
59             vec[v_end]++;
60         }
61     }
62 
63     // Smoothed probability p(c_next | context)
64     double prob(const string& context, char nextc) const {
65         auto it = counts.find(context);
66         int v = ch2idx.at(nextc);
67         long long denom_count = 0;
68         if (it != counts.end()) {
69             const auto& vec = it->second;
70             long long num = vec[v] + (long long)0; // for clarity
71             long long sum = 0;
72             for (long long cnt : vec) sum += cnt;
73             denom_count = sum;
74             return (num + alpha) / (sum + alpha * V());
75         } else {
76             // unseen context: back off to unigram with Laplace smoothing
77             long long sum = 0; for (auto c : unigram) sum += c;
78             return (unigram[v] + alpha) / (sum + alpha * V());
79         }
80     }
81 
82     // Sample next char from p(. | context)
83     char sample_next(const string& context, mt19937& rng) const {
84         vector<double> logits; logits.reserve(V());
85         double sum = 0.0;
86         for (char c : idx2ch) {
87             double pr = prob(context, c);
88             logits.push_back(pr);
89             sum += pr;
90         }
91         // Normalize (should be ~1 already, but ensure numerical stability)
92         for (double& v : logits) v /= sum;
93         uniform_real_distribution<double> unif(0.0, 1.0);
94         double r = unif(rng);
95         double acc = 0.0;
96         for (int i = 0; i < (int)idx2ch.size(); ++i) {
97             acc += logits[i];
98             if (r <= acc) return idx2ch[i];
99         }
100         return idx2ch.back(); // fallback
101     }
102 
103     // Compute log-likelihood and perplexity of a line
104     pair<double,double> evaluate(const string& raw) const {
105         string context(p, '^');
106         double ll = 0.0; int T = 0;
107         for (char ch : raw) {
108             double pr = prob(context, ch);
109             ll += log(max(pr, 1e-15));
110             ++T;
111             context.erase(context.begin());
112             context.push_back(ch);
113         }
114         // include end token
115         double pr_end = prob(context, '$');
116         ll += log(max(pr_end, 1e-15));
117         ++T;
118         double ppl = exp(-(ll / T));
119         return {ll, ppl};
120     }
121 
122     // Generate a sequence up to max_len or until '$'
123     string generate(int max_len, mt19937& rng) const {
124         string context(p, '^');
125         string out;
126         for (int t = 0; t < max_len; ++t) {
127             char c = sample_next(context, rng);
128             if (c == '$') break;
129             out.push_back(c);
130             context.erase(context.begin());
131             context.push_back(c);
132         }
133         return out;
134     }
135 };
136 
137 int main() {
138     ios::sync_with_stdio(false);
139     cin.tie(nullptr);
140 
141     // Small corpus: each string is a separate training sequence
142     vector<string> corpus = {
143         "hello world",
144         "hello there",
145         "hi world",
146         "hey you",
147         "hola mundo"
148     };
149 
150     int p = 3;           // 3-gram (tri-gram)
151     double alpha = 0.5;  // Laplace smoothing strength
152     NGram lm(p, alpha);
153 
154     // Build vocabulary and train
155     string all; for (auto& s : corpus) all += s;
156     lm.build_vocab(all);
157     lm.train(corpus);
158 
159     // Evaluate on a held-out example
160     string test = "hello you";
161     auto [ll, ppl] = lm.evaluate(test);
162     cout << fixed << setprecision(4);
163     cout << "Test: '" << test << "'\n";
164     cout << "  Log-likelihood = " << ll << "\n";
165     cout << "  Perplexity     = " << ppl << "\n\n";
166 
167     // Generate samples
168     mt19937 rng(123);
169     for (int i = 0; i < 3; ++i) {
170         string s = lm.generate(30, rng);
171         cout << "Sample " << (i+1) << ": " << s << "\n";
172     }
173 
174     // Show a conditional probability example
175     string ctx = string(p, '^'); // start-of-sequence context
176     cout << "\nNext-char probabilities from start context (top 5 by prob):\n";
177     vector<pair<double,char>> probs;
178     for (char c : lm.idx2ch) {
179         if (c == '^') continue; // skip start token display
180         probs.push_back({ lm.prob(ctx, c), c });
181     }
182     sort(probs.begin(), probs.end(), greater<>());
183     for (int i = 0; i < (int)min<size_t>(5, probs.size()); ++i) {
184         cout << "  '" << probs[i].second << "' : " << probs[i].first << "\n";
185     }
186     return 0;
187 }
188

This example trains a character-level p-gram (n-gram) autoregressive model with Laplace smoothing. It constructs p-length contexts, accumulates counts, converts them into smoothed conditional probabilities, evaluates log-likelihood and perplexity on a test string, and generates new samples by iteratively sampling the next character from p(x_i | x_{i-p:i-1}). It demonstrates the autoregressive factorization for discrete sequences and how to avoid zero probabilities via smoothing.

Time: O(N) to count over corpus length N; O(V) per sample step due to normalizationSpace: O(U * V) where U is the number of distinct contexts observed; O(V) for vocabulary storage

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Solve A x = b with Gaussian elimination and partial pivoting
5	vector<double> solve_linear_system(vector<vector<double>> A, vector<double> b) {
6	int n = (int)A.size();
7	for (int i = 0; i < n; ++i) A[i].push_back(b[i]);
8	// Forward elimination with partial pivoting
9	for (int col = 0; col < n; ++col) {
10	int pivot = col;
11	for (int r = col + 1; r < n; ++r) if (fabs(A[r][col]) > fabs(A[pivot][col])) pivot = r;
12	if (fabs(A[pivot][col]) < 1e-12) throw runtime_error("Singular matrix in solve_linear_system");
13	if (pivot != col) swap(A[pivot], A[col]);
14	double div = A[col][col];
15	for (int c = col; c <= n; ++c) A[col][c] /= div; // normalize pivot row
16	for (int r = 0; r < n; ++r) if (r != col) {
17	double factor = A[r][col];
18	if (factor == 0.0) continue;
19	for (int c = col; c <= n; ++c) A[r][c] -= factor * A[col][c];
20	}
21	}
22	vector<double> x(n);
23	for (int i = 0; i < n; ++i) x[i] = A[i][n];
24	return x;
25	}
26
27	struct ARModel {
28	int p; // order
29	double c; // intercept
30	vector<double> phi; // coefficients of lags 1..p
31	double sigma2; // noise variance estimate
32	};
33
34	// Fit AR(p) with intercept using normal equations (least squares)
35	ARModel fit_ar_least_squares(const vector<double>& x, int p) {
36	int n = (int)x.size();
37	if (n <= p) throw runtime_error("Not enough data to fit AR(p)");
38	int m = n - p; // number of rows in design matrix
39	int d = p + 1; // [intercept, x_{t-1}, ..., x_{t-p}]
40
41	// Compute G = X^T X (d x d) and h = X^T y (d)
42	vector<vector<double>> G(d, vector<double>(d, 0.0));
43	vector<double> h(d, 0.0);
44
45	auto get_row = [&](int t){ // t from p..n-1 maps to row index t-p
46	vector<double> row(d);
47	row[0] = 1.0; // intercept
48	for (int k = 1; k <= p; ++k) row[k] = x[t - k];
49	return row;
50	};
51
52	for (int t = p; t < n; ++t) {
53	vector<double> r = get_row(t);
54	double y = x[t];
55	for (int i = 0; i < d; ++i) {
56	h[i] += r[i] * y;
57	for (int j = 0; j < d; ++j) G[i][j] += r[i] * r[j];
58	}
59	}
60
61	// Solve for beta = [c, phi_1..phi_p]
62	vector<double> beta = solve_linear_system(G, h);
63
64	ARModel model;
65	model.p = p;
66	model.c = beta[0];
67	model.phi.assign(p, 0.0);
68	for (int k = 0; k < p; ++k) model.phi[k] = beta[k + 1];
69
70	// Estimate sigma^2 from residuals
71	double sse = 0.0;
72	for (int t = p; t < n; ++t) {
73	double pred = model.c;
74	for (int k = 1; k <= p; ++k) pred += model.phi[k - 1] * x[t - k];
75	double e = x[t] - pred;
76	sse += e * e;
77	}
78	model.sigma2 = sse / (m); // MLE for Gaussian noise (using m observations)
79	return model;
80	}
81
82	// One-step prediction given last p values (in order: x_{t-1},...,x_{t-p})
83	double predict_next(const ARModel& model, const deque<double>& last_p) {
84	if ((int)last_p.size() != model.p) throw runtime_error("last_p must have size p");
85	double pred = model.c;
86	for (int k = 0; k < model.p; ++k) pred += model.phi[k] * last_p[k];
87	return pred;
88	}
89
90	// k-step ahead forecast by iteratively feeding predictions
91	vector<double> forecast_k(const ARModel& model, const vector<double>& history, int k) {
92	if ((int)history.size() < model.p) throw runtime_error("Not enough history for forecasting");
93	deque<double> ctx; // x_{t-1},...,x_{t-p}
94	for (int i = 0; i < model.p; ++i) ctx.push_back(history[history.size() - 1 - i]);
95	vector<double> out;
96	for (int step = 0; step < k; ++step) {
97	double y = predict_next(model, ctx);
98	out.push_back(y);
99	// Update context: push front new y, pop back oldest
100	ctx.push_front(y);
101	if ((int)ctx.size() > model.p) ctx.pop_back();
102	}
103	return out;
104	}
105
106	int main() {
107	ios::sync_with_stdio(false);
108	cin.tie(nullptr);
109
110	// Generate synthetic AR(2)-like data: sinusoid + AR structure + noise
111	int n = 300;
112	vector<double> x(n, 0.0);
113	mt19937 rng(42);
114	normal_distribution<double> noise(0.0, 0.5);
115	// True-ish process: x_t = 0.5 + 1.2 x_{t-1} - 0.5 x_{t-2} + eps_t
116	x[0] = 0.0; x[1] = 0.3;
117	for (int t = 2; t < n; ++t) {
118	x[t] = 0.5 + 1.2 * x[t - 1] - 0.5 * x[t - 2] + noise(rng);
119	}
120
121	int p = 2;
122	ARModel model = fit_ar_least_squares(x, p);
123
124	cout << fixed << setprecision(4);
125	cout << "Fitted AR(" << p << ") model:\n";
126	cout << " c = " << model.c << "\n";
127	for (int k = 0; k < p; ++k) cout << " phi_" << (k+1) << " = " << model.phi[k] << "\n";
128	cout << " sigma^2 = " << model.sigma2 << "\n\n";
129
130	// Compute Gaussian log-likelihood on training data
131	double ll = 0.0;
132	int m = n - p;
133	double two_pi = acos(-1) * 2.0;
134	for (int t = p; t < n; ++t) {
135	double mu = model.c;
136	for (int k2 = 1; k2 <= p; ++k2) mu += model.phi[k2 - 1] * x[t - k2];
137	double e = x[t] - mu;
138	ll += -0.5 * (log(two_pi * model.sigma2) + (e * e) / model.sigma2);
139	}
140	cout << "Training log-likelihood: " << ll << " (average per-step = " << (ll / m) << ")\n\n";
141
142	// Forecast 5 steps ahead
143	vector<double> fut = forecast_k(model, x, 5);
144	cout << "5-step ahead forecasts:\n";
145	for (double v : fut) cout << v << ' ';
146	cout << "\n";
147	return 0;
148	}
149

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct NGram {
5	int p; // context length
6	double alpha; // Laplace smoothing
7	vector<char> idx2ch; // vocabulary mapping index -> char
8	unordered_map<char,int> ch2idx; // char -> index
9	unordered_map<string, vector<long long>> counts; // context -> counts over V
10	vector<long long> unigram; // counts for backoff or OOV safety
11
12	NGram(int p_, double alpha_) : p(p_), alpha(alpha_) {}
13
14	// Build vocabulary from text (plus start '^' and end '$')
15	void build_vocab(const string& text) {
16	unordered_set<char> S(text.begin(), text.end());
17	S.insert('^'); S.insert('$');
18	idx2ch.assign(S.begin(), S.end());
19	sort(idx2ch.begin(), idx2ch.end());
20	ch2idx.clear();
21	for (int i = 0; i < (int)idx2ch.size(); ++i) ch2idx[idx2ch[i]] = i;
22	unigram.assign(idx2ch.size(), 0);
23	}
24
25	int V() const { return (int)idx2ch.size(); }
26
27	// Train counts from one or more lines (each treated as a sequence ending with '$')
28	void train(const vector<string>& lines) {
29	// Ensure vocabulary is built before training
30	if (idx2ch.empty()) {
31	string all;
32	for (auto& s : lines) all += s;
33	build_vocab(all);
34	}
35	counts.clear();
36	fill(unigram.begin(), unigram.end(), 0);
37	for (const string& raw : lines) {
38	string s = raw;
39	// Build padded sequence with p start tokens
40	string context(p, '^');
41	for (char ch : s) {
42	char c = ch;
43	if (!ch2idx.count(c)) continue; // skip unknowns (shouldn't happen after build_vocab)
44	int v = ch2idx[c];
45	unigram[v]++;
46	auto& vec = counts[context];
47	if (vec.empty()) vec.assign(V(), 0);
48	vec[v]++;
49	// advance context
50	context.erase(context.begin());
51	context.push_back(c);
52	}
53	// Add end token '$'
54	char endc = '$';
55	int v_end = ch2idx[endc];
56	unigram[v_end]++;
57	auto& vec = counts[context];
58	if (vec.empty()) vec.assign(V(), 0);
59	vec[v_end]++;
60	}
61	}
62
63	// Smoothed probability p(c_next \| context)
64	double prob(const string& context, char nextc) const {
65	auto it = counts.find(context);
66	int v = ch2idx.at(nextc);
67	long long denom_count = 0;
68	if (it != counts.end()) {
69	const auto& vec = it->second;
70	long long num = vec[v] + (long long)0; // for clarity
71	long long sum = 0;
72	for (long long cnt : vec) sum += cnt;
73	denom_count = sum;
74	return (num + alpha) / (sum + alpha * V());
75	} else {
76	// unseen context: back off to unigram with Laplace smoothing
77	long long sum = 0; for (auto c : unigram) sum += c;
78	return (unigram[v] + alpha) / (sum + alpha * V());
79	}
80	}
81
82	// Sample next char from p(. \| context)
83	char sample_next(const string& context, mt19937& rng) const {
84	vector<double> logits; logits.reserve(V());
85	double sum = 0.0;
86	for (char c : idx2ch) {
87	double pr = prob(context, c);
88	logits.push_back(pr);
89	sum += pr;
90	}
91	// Normalize (should be ~1 already, but ensure numerical stability)
92	for (double& v : logits) v /= sum;
93	uniform_real_distribution<double> unif(0.0, 1.0);
94	double r = unif(rng);
95	double acc = 0.0;
96	for (int i = 0; i < (int)idx2ch.size(); ++i) {
97	acc += logits[i];
98	if (r <= acc) return idx2ch[i];
99	}
100	return idx2ch.back(); // fallback
101	}
102
103	// Compute log-likelihood and perplexity of a line
104	pair<double,double> evaluate(const string& raw) const {
105	string context(p, '^');
106	double ll = 0.0; int T = 0;
107	for (char ch : raw) {
108	double pr = prob(context, ch);
109	ll += log(max(pr, 1e-15));
110	++T;
111	context.erase(context.begin());
112	context.push_back(ch);
113	}
114	// include end token
115	double pr_end = prob(context, '$');
116	ll += log(max(pr_end, 1e-15));
117	++T;
118	double ppl = exp(-(ll / T));
119	return {ll, ppl};
120	}
121
122	// Generate a sequence up to max_len or until '$'
123	string generate(int max_len, mt19937& rng) const {
124	string context(p, '^');
125	string out;
126	for (int t = 0; t < max_len; ++t) {
127	char c = sample_next(context, rng);
128	if (c == '$') break;
129	out.push_back(c);
130	context.erase(context.begin());
131	context.push_back(c);
132	}
133	return out;
134	}
135	};
136
137	int main() {
138	ios::sync_with_stdio(false);
139	cin.tie(nullptr);
140
141	// Small corpus: each string is a separate training sequence
142	vector<string> corpus = {
143	"hello world",
144	"hello there",
145	"hi world",
146	"hey you",
147	"hola mundo"
148	};
149
150	int p = 3; // 3-gram (tri-gram)
151	double alpha = 0.5; // Laplace smoothing strength
152	NGram lm(p, alpha);
153
154	// Build vocabulary and train
155	string all; for (auto& s : corpus) all += s;
156	lm.build_vocab(all);
157	lm.train(corpus);
158
159	// Evaluate on a held-out example
160	string test = "hello you";
161	auto [ll, ppl] = lm.evaluate(test);
162	cout << fixed << setprecision(4);
163	cout << "Test: '" << test << "'\n";
164	cout << " Log-likelihood = " << ll << "\n";
165	cout << " Perplexity = " << ppl << "\n\n";
166
167	// Generate samples
168	mt19937 rng(123);
169	for (int i = 0; i < 3; ++i) {
170	string s = lm.generate(30, rng);
171	cout << "Sample " << (i+1) << ": " << s << "\n";
172	}
173
174	// Show a conditional probability example
175	string ctx = string(p, '^'); // start-of-sequence context
176	cout << "\nNext-char probabilities from start context (top 5 by prob):\n";
177	vector<pair<double,char>> probs;
178	for (char c : lm.idx2ch) {
179	if (c == '^') continue; // skip start token display
180	probs.push_back({ lm.prob(ctx, c), c });
181	}
182	sort(probs.begin(), probs.end(), greater<>());
183	for (int i = 0; i < (int)min<size_t>(5, probs.size()); ++i) {
184	cout << " '" << probs[i].second << "' : " << probs[i].first << "\n";
185	}
186	return 0;
187	}
188