📚TheoryAdvanced

Algorithmic Information Theory

Key Points

  • Algorithmic Information Theory studies information content via the shortest programs that generate data, rather than via average-case probabilities.
  • Kolmogorov complexity K(x) is the length of the shortest program that outputs x on a fixed universal computer; it is incomputable but can be upper-bounded by compression.
  • Most strings are incompressible: nearly all n-bit strings require about n bits to describe, which formalizes the idea of randomness.
  • Algorithmic probability assigns higher prior weight to strings that have many short programs, leading to Solomonoff induction as a theoretically optimal predictor.
  • The Minimum Description Length (MDL) principle performs model selection by minimizing total code length: model cost plus data cost given the model.
  • Normalized Compression Distance (NCD) transforms any compressor into a task-agnostic similarity measure for clustering and retrieval.
  • AIT connects to AI/ML as a foundation for compression-based learning, generalization via Occam’s razor, and nonparametric priors.
  • Although exact K(x) is uncomputable, practical approximations using real compressors and MDL are powerful and widely applicable.

Prerequisites

  • Turing machines and computabilityKolmogorov complexity is defined with respect to a universal Turing machine and is incomputable.
  • Prefix codes and Kraft inequalityAlgorithmic probability relies on prefix-freeness to justify probability mass assignments.
  • Shannon information theoryContrasting entropy (average-case) with Kolmogorov complexity (individual-case) clarifies the concepts.
  • Probability and Bayesian inferenceSolomonoff induction and MDL connect priors, likelihoods, and description lengths.
  • Data compression (e.g., LZ77/LZ78)Compressors provide practical upper bounds on K(x) and power NCD.
  • Asymptotic notationProperties like O(1) machine constants and O(log n) overheads appear throughout AIT.
  • Markov modelsUsed in MDL examples and common compression/prediction schemes.
  • Logarithms and exponentialsDescription lengths are measured in bits as −log2 probabilities.

Detailed Explanation

Tap terms for definitions

01Overview

Algorithmic Information Theory (AIT) reimagines information as the length of the shortest computer program that can produce a given object, such as a string. Instead of asking how likely data is under a statistical model, AIT asks how simply it can be generated. The key quantity is Kolmogorov complexity, K(x), which measures the minimal description length of x with respect to a universal Turing machine. While this exact measure is mathematically robust and machine-invariant up to an additive constant, it is provably incomputable. Nonetheless, it provides a gold standard that guides practical principles. From AIT flow several cornerstone ideas: algorithmic probability, which forms a universal prior giving more mass to simpler explanations; Solomonoff induction, an idealized predictor that uses this prior; MDL, a pragmatic version of Occam’s razor that selects models minimizing total description length; and compression-based distances, like NCD, which repurpose compressors as similarity or clustering tools. In machine learning, these ideas justify why simpler models can generalize better, inspire universal coding schemes, and motivate compression-driven approaches to representation learning. While we cannot compute K(x), we can upper-bound it with real compressors and design algorithms that approximately follow the AIT playbook.

02Intuition & Analogies

Imagine you want to explain a long sequence of characters to a friend. If the sequence is “ABABAB…AB” repeating 100 times, you might say, “Write AB fifty times.” That’s a very short explanation for a long string—your friend can reconstruct the exact data from a concise recipe. But if the sequence looks like random gibberish, your best explanation might be to read it out letter by letter. That informal shortest-explanation length is the essence of Kolmogorov complexity. Think of programs as recipes and a universal computer as a perfect chef who follows any recipe you give. The complexity K(x) is the length of the shortest recipe that makes the chef produce x. Short recipes mean simple, regular data; long recipes mean complex, irregular data. Most long strings look random because there isn’t a trick or shortcut—no recipe much shorter than listing them verbatim. Now flip the story: if there are many short recipes that can make x, then before seeing any data, x was already quite plausible. That is algorithmic probability—the chance that a chef who randomly picks a recipe (short recipes more likely than long ones) will cook up x. Solomonoff induction is like hiring such a chef as your forecaster: at each step, it favors continuations that admit shorter recipes. In practice we can’t enumerate all recipes, but we can imitate this behavior with compression. Compressors are machines that search for short descriptions; if one string helps compress another, they’re probably related—this is the idea behind NCD. And MDL is what your inner explainer does: pick the model plus encoding of the data that needs the fewest bits overall—no more, no less.

03Formal Definition

Fix a universal prefix Turing machine U. The (prefix) Kolmogorov complexity of a binary string x is defined by \[ (x) = \;, \] where is the length of program p in bits. By the invariance theorem, for any two universal machines U and V there exists a constant such that for all x, \[ . \] Thus we usually write K(x), ignoring O(1) machine-dependent constants. Conditional complexity K(x y) is the length of the shortest p such that U(p, y) = x. Fundamental properties include upper bounds like K(x) + O(1) (print x literally), and subadditivity forms such as K(xy) K(x) + K(y x) + O((+)). Incompressibility theorems state that for any n and k, at least 2^n - 2^{n-k} strings of length n satisfy K(x) n - k. Algorithmic probability (Solomonoff’s prior) is \[ (x) = 2^{-}, \] a semimeasure dominated by the Kraft inequality. The Levin coding theorem links probability and complexity: \[ K(x) = - (x) + O(1). \] The Solomonoff predictor uses the corresponding prefix probabilities to predict sequences. In practice, MDL minimizes a two-part code L(D, M) = L(M) + L(D M), while the Normalized Compression Distance is \[ (x,y) = , \] with C() the compressed size under a fixed compressor.

04When to Use

Use Kolmogorov-style reasoning when you care about absolute descriptive complexity rather than average-case entropy—for instance, to argue about overfitting and generalization: simpler hypotheses (shorter descriptions) are preferred. In practice:

  • Model selection: Apply MDL when comparing models with different numbers of parameters or different structures; it balances fit (data code length) against model complexity (model code length).
  • Clustering and retrieval: Use NCD when you want a task-agnostic similarity that leverages any off-the-shelf compressor; this helps in domains where feature engineering is hard (e.g., biosequences, source code, logs).
  • Anomaly detection: Treat unusually poor compression as a signal that an item doesn’t conform to the learned regularities of a corpus.
  • Universal coding and prediction: Employ prequential/online codes (e.g., Krichevsky–Trofimov) as MDL-friendly predictors for discrete sequences.
  • Theoretical bounds: Reason about the limits of learning and induction, or to justify Occam’s razor via description length. Avoid relying on exact K(x) for algorithms—K is incomputable. Instead, substitute a realistic compressor (for upper bounds) or use principled codes (e.g., KT/NML) inside MDL frameworks.

⚠️Common Mistakes

  • Treating K(x) as computable: There is no algorithm that returns exact Kolmogorov complexity for all x. Use compressors to get upper bounds or use MDL with explicit coding schemes.
  • Ignoring the O(1) machine constant: For very short strings, the constant overhead can dominate; complexity comparisons only make sense asymptotically or with the same reference machine/codec.
  • Confusing entropy with complexity: Shannon entropy is an average over a distribution; Kolmogorov complexity is for individual strings. A single low-entropy-looking string can still have high K(x), and vice versa.
  • Misusing NCD: Different compressors produce different distances; NCD is only a metric up to compressor ideality assumptions. Always fix one compressor and be wary of very short inputs.
  • Overfitting in MDL: If L(M) (model cost) is omitted or understated, MDL collapses to maximum likelihood and can overfit. Include an appropriate penalty (e.g., 0.5 k \log n bits for k parameters) or a rigorous universal code.
  • Double-dipping with training/test: Using the same data to tune compressor settings and to evaluate compression-based similarity can inflate performance. Separate calibration from evaluation.
  • Ignoring data representation: Byte-level compressors are sensitive to encoding (e.g., UTF-8 vs UTF-16) and preprocessing. Keep representation consistent.

Key Formulas

Kolmogorov Complexity

Explanation: Defines the shortest prefix-free program that outputs x on universal machine U. It formalizes the absolute description length of x.

Invariance Theorem

Explanation: Kolmogorov complexity depends on the reference machine only up to an additive constant. Comparisons are meaningful up to this constant.

Trivial Upper Bound

Explanation: A program can hardcode x and print it, so its description length is at most the literal length plus a fixed overhead.

Subadditivity (prefix version)

Explanation: To describe xy, describe x, then describe y given x, plus small overhead. This extends the chain rule to algorithmic complexity.

Incompressibility Counting

Explanation: At most 2^{n-k}-1 strings of length n can be compressed by at least k bits. Therefore most n-bit strings are incompressible.

Algorithmic Probability

Explanation: The Solomonoff prior assigns probability to x by summing over all programs that produce x, weighting shorter programs exponentially more.

Levin Coding Theorem

Explanation: Complexity equals the negative log of algorithmic probability up to a constant. High-probability strings have low complexity.

Solomonoff Mixture

Explanation: Defines the universal semimeasure over prefixes. It dominates any computable measure, yielding theoretically optimal sequence prediction.

MDL Two-Part Code

Explanation: Total description length is the code for the model plus the code for data given the model. Choose M that minimizes L(D,M).

Normalized Compression Distance

Explanation: Compares similarity by how well x helps compress y (and vice versa) relative to their own compressed sizes. Values near 0 imply high similarity.

Kraft Inequality

Explanation: For any prefix-free set of programs (codes), the sum of 2^{-length} . This underpins the probabilistic interpretation of program lengths.

Incomputability of K

Explanation: No algorithm can compute exact Kolmogorov complexity for all inputs. Any procedure can at best provide upper bounds.

Levin (Time-Bounded) Complexity

Explanation: Trades program length and running time. It favors short and fast programs, making search feasible in theory.

Complexity Analysis

Exact Kolmogorov complexity is incomputable, so algorithmic analyses focus on approximations: compressors and explicit coding schemes. When upper-bounding K(x) by a compressor C, the computational cost is that of C. Many practical dictionary/block compressors (e.g., LZ77/LZ78) run in roughly O(n) to O(n log n) time on input length n, with memory between O(1) and O(n) depending on window and data structure choices. Normalized Compression Distance requires three compressions—C(x), C(y), and C(xy)—so its time is about three times the compressor cost; space is that of a single compression run. For MDL with prequential codes like the Krichevsky–Trofimov (KT) estimator on binary sequences, computing the data code length is O(n) time and O(1) space by maintaining symbol counts incrementally and summing −log probabilities. For a first-order Markov KT code, we maintain separate counts per context (two contexts for binary), still O(n) time and O(1) space. The model cost component (e.g., 0.5 k log n bits for k parameters) is O(1) to compute given n and k. Educational compressors implemented from scratch (e.g., naïve LZ78 with string keys) may incur higher constants and, due to string concatenations and hash-table operations, can degrade toward O() in worst cases. Optimized implementations using integer dictionaries and rolling hashes bring this back toward O(n). In all cases, ensure consistent encoding units (bits vs bytes) and handle very short strings carefully, where additive constants dominate and numerical stability (e.g., log of small probabilities) matters.

Code Examples

Estimating Kolmogorov Complexity and Computing NCD with a Simple LZ78 Compressor
1#include <iostream>
2#include <string>
3#include <unordered_map>
4#include <cmath>
5#include <iomanip>
6
7// Estimate compressed size (in bits) using a simple educational LZ78 scheme.
8// This is not an optimized compressor; it is self-contained for teaching.
9// Encoding model: emit pairs (index_of_phrase, next_char). Index uses
10// ceil(log2(dict_size + 1)) bits to reference any existing dictionary entry,
11// and next_char uses 8 bits. We also emit a final pair if a phrase remains.
12
13static size_t lz78CompressedBits(const std::string &s) {
14 std::unordered_map<std::string, int> dict; // phrase -> index
15 dict.reserve(s.size() * 2);
16 dict.max_load_factor(0.7f);
17
18 // Index 0 is the empty phrase ""
19 dict[""] = 0;
20 std::string w; // current phrase
21 size_t bits = 0;
22
23 auto indexBitWidth = [&](int dictSize) -> int {
24 // Need to address indices in [0..dictSize-1]; use at least 1 bit
25 if (dictSize <= 1) return 1; // only empty phrase exists
26 return (int)std::ceil(std::log2((double)dictSize));
27 };
28
29 for (char c : s) {
30 std::string wc = w;
31 wc.push_back(c);
32 if (dict.find(wc) != dict.end()) {
33 // Extend current phrase
34 w.swap(wc); // w = wc
35 } else {
36 // Emit (index(w), c)
37 int idx = dict[w];
38 int idxBits = indexBitWidth((int)dict.size());
39 bits += (size_t)idxBits + 8; // index + next_char
40 // Add new phrase wc to dictionary with next index
41 dict[wc] = (int)dict.size();
42 // Reset current phrase
43 w.clear();
44 }
45 }
46
47 // If there's a remaining phrase w, emit (index(w), special_end_char)
48 if (!w.empty()) {
49 int idxBits = indexBitWidth((int)dict.size());
50 bits += (size_t)idxBits + 8; // pay a char-sized marker cost
51 }
52
53 return bits;
54}
55
56static double ncd(const std::string &x, const std::string &y) {
57 size_t Cx = lz78CompressedBits(x);
58 size_t Cy = lz78CompressedBits(y);
59 std::string xy = x + y;
60 size_t Cxy = lz78CompressedBits(xy);
61
62 size_t maxC = std::max(Cx, Cy);
63 size_t minC = std::min(Cx, Cy);
64 if (maxC == 0) return 0.0; // both empty strings
65 return (double)(Cxy - minC) / (double)maxC;
66}
67
68int main() {
69 std::string a = std::string(10, 'A') + std::string(10, 'B'); // AAAAAAAAAABBBBBBBBBB
70 std::string b;
71 for (int i = 0; i < 20; ++i) b += (i % 2 == 0 ? 'A' : 'B'); // ABABAB... (20)
72 std::string c = "QWERTYUIOPASDFGHJKLZ"; // looks irregular
73
74 size_t Ca = lz78CompressedBits(a);
75 size_t Cb = lz78CompressedBits(b);
76 size_t Cc = lz78CompressedBits(c);
77
78 std::cout << "Estimated LZ78 bits (upper bounds on K up to constants):\n";
79 std::cout << "C(a) = " << Ca << " bits\n";
80 std::cout << "C(b) = " << Cb << " bits\n";
81 std::cout << "C(c) = " << Cc << " bits\n\n";
82
83 std::cout << std::fixed << std::setprecision(4);
84 std::cout << "NCD(a,b) = " << ncd(a, b) << "\n";
85 std::cout << "NCD(a,c) = " << ncd(a, c) << "\n";
86 std::cout << "NCD(b,c) = " << ncd(b, c) << "\n";
87
88 return 0;
89}
90

We implement a compact LZ78-style compressor that emits (index, next_char) pairs and use its bitcount as a practical upper bound on Kolmogorov complexity. Repetitive strings compress much better than irregular ones. With the same compressor, we compute the Normalized Compression Distance (NCD), which compares how much extra description is needed for the concatenation versus the individual compressions.

Time: O(n) average with hashing, but may degrade toward O(n^2) from string concatenations and hash-collisions in this educational version.Space: O(n) for the dictionary in the worst case.
MDL Model Selection: i.i.d. Bernoulli vs First-Order Markov Using KT Codes
1#include <iostream>
2#include <string>
3#include <cmath>
4#include <random>
5#include <iomanip>
6
7// Compute prequential KT code length (in bits) for a binary string under i.i.d. Bernoulli.
8// Probability of next 1 at time t is (n1 + 1/2) / (t + 1); similarly for 0.
9static double ktBernoulliCodeLenBits(const std::string &s) {
10 int n0 = 0, n1 = 0; // counts so far
11 double L = 0.0;
12 for (size_t t = 0; t < s.size(); ++t) {
13 int bit = (s[t] == '1') ? 1 : 0;
14 double p = (bit == 1) ? (n1 + 0.5) / (t + 1.0) : (n0 + 0.5) / (t + 1.0);
15 L += -std::log2(p);
16 if (bit) ++n1; else ++n0;
17 }
18 return L;
19}
20
21// Compute prequential KT code length (in bits) for a binary string under a 1st-order Markov model.
22// We maintain two KT predictors, one for context 0 and one for context 1.
23static double ktMarkov1CodeLenBits(const std::string &s) {
24 if (s.empty()) return 0.0;
25 // Cost of first symbol with KT Bernoulli from scratch: p = 1/2 => 1 bit
26 double L = 1.0;
27 int c[2][2] = {{0,0},{0,0}}; // c[prev][next]
28 int tot[2] = {0,0}; // totals per prev
29
30 for (size_t i = 1; i < s.size(); ++i) {
31 int prev = (s[i-1] == '1') ? 1 : 0;
32 int bit = (s[i] == '1') ? 1 : 0;
33 double p = (c[prev][bit] + 0.5) / (tot[prev] + 1.0);
34 L += -std::log2(p);
35 c[prev][bit]++; tot[prev]++;
36 }
37 return L;
38}
39
40struct MDLResult {
41 std::string chosen;
42 double L_bernoulli;
43 double L_markov1;
44};
45
46static MDLResult mdlSelect(const std::string &s) {
47 const int k_bern = 1; // number of parameters (asymptotically)
48 const int k_mark = 2; // p(1|0), p(1|1)
49 int n = (int)s.size();
50 double L_D_bern = ktBernoulliCodeLenBits(s);
51 double L_D_mark = ktMarkov1CodeLenBits(s);
52 // Two-part MDL: L(M) ~ 0.5 * k * log2(n) + const; add 1 bit to identify the model
53 double L_M_bern = 1.0 + 0.5 * k_bern * (n > 0 ? std::log2((double)n) : 0.0);
54 double L_M_mark = 1.0 + 0.5 * k_mark * (n > 0 ? std::log2((double)n) : 0.0);
55
56 double L_bern = L_M_bern + L_D_bern;
57 double L_mark = L_M_mark + L_D_mark;
58
59 MDLResult r;
60 r.L_bernoulli = L_bern;
61 r.L_markov1 = L_mark;
62 r.chosen = (L_bern <= L_mark) ? "Bernoulli(iid)" : "Markov(1st-order)";
63 return r;
64}
65
66int main() {
67 // Example 1: Alternating pattern should favor Markov(1)
68 std::string alt;
69 for (int i = 0; i < 200; ++i) alt.push_back((i % 2) ? '1' : '0');
70 // Example 2: Random-looking pattern should favor Bernoulli
71 std::mt19937 rng(12345);
72 std::bernoulli_distribution coin(0.5);
73 std::string rnd;
74 for (int i = 0; i < 200; ++i) rnd.push_back(coin(rng) ? '1' : '0');
75
76 auto R1 = mdlSelect(alt);
77 auto R2 = mdlSelect(rnd);
78
79 std::cout << std::fixed << std::setprecision(3);
80 std::cout << "ALT chosen: " << R1.chosen
81 << ", L_bern=" << R1.L_bernoulli
82 << ", L_mark=" << R1.L_markov1 << "\n";
83 std::cout << "RND chosen: " << R2.chosen
84 << ", L_bern=" << R2.L_bernoulli
85 << ", L_mark=" << R2.L_markov1 << "\n";
86
87 return 0;
88}
89

We implement MDL model selection between two binary sequence models. The data code length is computed with prequential KT coding: for Bernoulli, a single KT predictor over the whole sequence; for first-order Markov, two KT predictors conditioned on the previous bit. The model cost penalizes parameters by roughly 0.5 k log2 n bits and adds 1 bit to identify the model. The chosen model minimizes total description length.

Time: O(n) for each model’s code length; overall O(n).Space: O(1) extra space for counters.