Algorithmic Information Theory
Key Points
- •Algorithmic Information Theory studies information content via the shortest programs that generate data, rather than via average-case probabilities.
- •Kolmogorov complexity K(x) is the length of the shortest program that outputs x on a fixed universal computer; it is incomputable but can be upper-bounded by compression.
- •Most strings are incompressible: nearly all n-bit strings require about n bits to describe, which formalizes the idea of randomness.
- •Algorithmic probability assigns higher prior weight to strings that have many short programs, leading to Solomonoff induction as a theoretically optimal predictor.
- •The Minimum Description Length (MDL) principle performs model selection by minimizing total code length: model cost plus data cost given the model.
- •Normalized Compression Distance (NCD) transforms any compressor into a task-agnostic similarity measure for clustering and retrieval.
- •AIT connects to AI/ML as a foundation for compression-based learning, generalization via Occam’s razor, and nonparametric priors.
- •Although exact K(x) is uncomputable, practical approximations using real compressors and MDL are powerful and widely applicable.
Prerequisites
- →Turing machines and computability — Kolmogorov complexity is defined with respect to a universal Turing machine and is incomputable.
- →Prefix codes and Kraft inequality — Algorithmic probability relies on prefix-freeness to justify probability mass assignments.
- →Shannon information theory — Contrasting entropy (average-case) with Kolmogorov complexity (individual-case) clarifies the concepts.
- →Probability and Bayesian inference — Solomonoff induction and MDL connect priors, likelihoods, and description lengths.
- →Data compression (e.g., LZ77/LZ78) — Compressors provide practical upper bounds on K(x) and power NCD.
- →Asymptotic notation — Properties like O(1) machine constants and O(log n) overheads appear throughout AIT.
- →Markov models — Used in MDL examples and common compression/prediction schemes.
- →Logarithms and exponentials — Description lengths are measured in bits as −log2 probabilities.
Detailed Explanation
Tap terms for definitions01Overview
Algorithmic Information Theory (AIT) reimagines information as the length of the shortest computer program that can produce a given object, such as a string. Instead of asking how likely data is under a statistical model, AIT asks how simply it can be generated. The key quantity is Kolmogorov complexity, K(x), which measures the minimal description length of x with respect to a universal Turing machine. While this exact measure is mathematically robust and machine-invariant up to an additive constant, it is provably incomputable. Nonetheless, it provides a gold standard that guides practical principles. From AIT flow several cornerstone ideas: algorithmic probability, which forms a universal prior giving more mass to simpler explanations; Solomonoff induction, an idealized predictor that uses this prior; MDL, a pragmatic version of Occam’s razor that selects models minimizing total description length; and compression-based distances, like NCD, which repurpose compressors as similarity or clustering tools. In machine learning, these ideas justify why simpler models can generalize better, inspire universal coding schemes, and motivate compression-driven approaches to representation learning. While we cannot compute K(x), we can upper-bound it with real compressors and design algorithms that approximately follow the AIT playbook.
02Intuition & Analogies
Imagine you want to explain a long sequence of characters to a friend. If the sequence is “ABABAB…AB” repeating 100 times, you might say, “Write AB fifty times.” That’s a very short explanation for a long string—your friend can reconstruct the exact data from a concise recipe. But if the sequence looks like random gibberish, your best explanation might be to read it out letter by letter. That informal shortest-explanation length is the essence of Kolmogorov complexity. Think of programs as recipes and a universal computer as a perfect chef who follows any recipe you give. The complexity K(x) is the length of the shortest recipe that makes the chef produce x. Short recipes mean simple, regular data; long recipes mean complex, irregular data. Most long strings look random because there isn’t a trick or shortcut—no recipe much shorter than listing them verbatim. Now flip the story: if there are many short recipes that can make x, then before seeing any data, x was already quite plausible. That is algorithmic probability—the chance that a chef who randomly picks a recipe (short recipes more likely than long ones) will cook up x. Solomonoff induction is like hiring such a chef as your forecaster: at each step, it favors continuations that admit shorter recipes. In practice we can’t enumerate all recipes, but we can imitate this behavior with compression. Compressors are machines that search for short descriptions; if one string helps compress another, they’re probably related—this is the idea behind NCD. And MDL is what your inner explainer does: pick the model plus encoding of the data that needs the fewest bits overall—no more, no less.
03Formal Definition
04When to Use
Use Kolmogorov-style reasoning when you care about absolute descriptive complexity rather than average-case entropy—for instance, to argue about overfitting and generalization: simpler hypotheses (shorter descriptions) are preferred. In practice:
- Model selection: Apply MDL when comparing models with different numbers of parameters or different structures; it balances fit (data code length) against model complexity (model code length).
- Clustering and retrieval: Use NCD when you want a task-agnostic similarity that leverages any off-the-shelf compressor; this helps in domains where feature engineering is hard (e.g., biosequences, source code, logs).
- Anomaly detection: Treat unusually poor compression as a signal that an item doesn’t conform to the learned regularities of a corpus.
- Universal coding and prediction: Employ prequential/online codes (e.g., Krichevsky–Trofimov) as MDL-friendly predictors for discrete sequences.
- Theoretical bounds: Reason about the limits of learning and induction, or to justify Occam’s razor via description length. Avoid relying on exact K(x) for algorithms—K is incomputable. Instead, substitute a realistic compressor (for upper bounds) or use principled codes (e.g., KT/NML) inside MDL frameworks.
⚠️Common Mistakes
- Treating K(x) as computable: There is no algorithm that returns exact Kolmogorov complexity for all x. Use compressors to get upper bounds or use MDL with explicit coding schemes.
- Ignoring the O(1) machine constant: For very short strings, the constant overhead can dominate; complexity comparisons only make sense asymptotically or with the same reference machine/codec.
- Confusing entropy with complexity: Shannon entropy is an average over a distribution; Kolmogorov complexity is for individual strings. A single low-entropy-looking string can still have high K(x), and vice versa.
- Misusing NCD: Different compressors produce different distances; NCD is only a metric up to compressor ideality assumptions. Always fix one compressor and be wary of very short inputs.
- Overfitting in MDL: If L(M) (model cost) is omitted or understated, MDL collapses to maximum likelihood and can overfit. Include an appropriate penalty (e.g., 0.5 k \log n bits for k parameters) or a rigorous universal code.
- Double-dipping with training/test: Using the same data to tune compressor settings and to evaluate compression-based similarity can inflate performance. Separate calibration from evaluation.
- Ignoring data representation: Byte-level compressors are sensitive to encoding (e.g., UTF-8 vs UTF-16) and preprocessing. Keep representation consistent.
Key Formulas
Kolmogorov Complexity
Explanation: Defines the shortest prefix-free program that outputs x on universal machine U. It formalizes the absolute description length of x.
Invariance Theorem
Explanation: Kolmogorov complexity depends on the reference machine only up to an additive constant. Comparisons are meaningful up to this constant.
Trivial Upper Bound
Explanation: A program can hardcode x and print it, so its description length is at most the literal length plus a fixed overhead.
Subadditivity (prefix version)
Explanation: To describe xy, describe x, then describe y given x, plus small overhead. This extends the chain rule to algorithmic complexity.
Incompressibility Counting
Explanation: At most 2^{n-k}-1 strings of length n can be compressed by at least k bits. Therefore most n-bit strings are incompressible.
Algorithmic Probability
Explanation: The Solomonoff prior assigns probability to x by summing over all programs that produce x, weighting shorter programs exponentially more.
Levin Coding Theorem
Explanation: Complexity equals the negative log of algorithmic probability up to a constant. High-probability strings have low complexity.
Solomonoff Mixture
Explanation: Defines the universal semimeasure over prefixes. It dominates any computable measure, yielding theoretically optimal sequence prediction.
MDL Two-Part Code
Explanation: Total description length is the code for the model plus the code for data given the model. Choose M that minimizes L(D,M).
Normalized Compression Distance
Explanation: Compares similarity by how well x helps compress y (and vice versa) relative to their own compressed sizes. Values near 0 imply high similarity.
Kraft Inequality
Explanation: For any prefix-free set of programs (codes), the sum of 2^{-length} . This underpins the probabilistic interpretation of program lengths.
Incomputability of K
Explanation: No algorithm can compute exact Kolmogorov complexity for all inputs. Any procedure can at best provide upper bounds.
Levin (Time-Bounded) Complexity
Explanation: Trades program length and running time. It favors short and fast programs, making search feasible in theory.
Complexity Analysis
Code Examples
1 #include <iostream> 2 #include <string> 3 #include <unordered_map> 4 #include <cmath> 5 #include <iomanip> 6 7 // Estimate compressed size (in bits) using a simple educational LZ78 scheme. 8 // This is not an optimized compressor; it is self-contained for teaching. 9 // Encoding model: emit pairs (index_of_phrase, next_char). Index uses 10 // ceil(log2(dict_size + 1)) bits to reference any existing dictionary entry, 11 // and next_char uses 8 bits. We also emit a final pair if a phrase remains. 12 13 static size_t lz78CompressedBits(const std::string &s) { 14 std::unordered_map<std::string, int> dict; // phrase -> index 15 dict.reserve(s.size() * 2); 16 dict.max_load_factor(0.7f); 17 18 // Index 0 is the empty phrase "" 19 dict[""] = 0; 20 std::string w; // current phrase 21 size_t bits = 0; 22 23 auto indexBitWidth = [&](int dictSize) -> int { 24 // Need to address indices in [0..dictSize-1]; use at least 1 bit 25 if (dictSize <= 1) return 1; // only empty phrase exists 26 return (int)std::ceil(std::log2((double)dictSize)); 27 }; 28 29 for (char c : s) { 30 std::string wc = w; 31 wc.push_back(c); 32 if (dict.find(wc) != dict.end()) { 33 // Extend current phrase 34 w.swap(wc); // w = wc 35 } else { 36 // Emit (index(w), c) 37 int idx = dict[w]; 38 int idxBits = indexBitWidth((int)dict.size()); 39 bits += (size_t)idxBits + 8; // index + next_char 40 // Add new phrase wc to dictionary with next index 41 dict[wc] = (int)dict.size(); 42 // Reset current phrase 43 w.clear(); 44 } 45 } 46 47 // If there's a remaining phrase w, emit (index(w), special_end_char) 48 if (!w.empty()) { 49 int idxBits = indexBitWidth((int)dict.size()); 50 bits += (size_t)idxBits + 8; // pay a char-sized marker cost 51 } 52 53 return bits; 54 } 55 56 static double ncd(const std::string &x, const std::string &y) { 57 size_t Cx = lz78CompressedBits(x); 58 size_t Cy = lz78CompressedBits(y); 59 std::string xy = x + y; 60 size_t Cxy = lz78CompressedBits(xy); 61 62 size_t maxC = std::max(Cx, Cy); 63 size_t minC = std::min(Cx, Cy); 64 if (maxC == 0) return 0.0; // both empty strings 65 return (double)(Cxy - minC) / (double)maxC; 66 } 67 68 int main() { 69 std::string a = std::string(10, 'A') + std::string(10, 'B'); // AAAAAAAAAABBBBBBBBBB 70 std::string b; 71 for (int i = 0; i < 20; ++i) b += (i % 2 == 0 ? 'A' : 'B'); // ABABAB... (20) 72 std::string c = "QWERTYUIOPASDFGHJKLZ"; // looks irregular 73 74 size_t Ca = lz78CompressedBits(a); 75 size_t Cb = lz78CompressedBits(b); 76 size_t Cc = lz78CompressedBits(c); 77 78 std::cout << "Estimated LZ78 bits (upper bounds on K up to constants):\n"; 79 std::cout << "C(a) = " << Ca << " bits\n"; 80 std::cout << "C(b) = " << Cb << " bits\n"; 81 std::cout << "C(c) = " << Cc << " bits\n\n"; 82 83 std::cout << std::fixed << std::setprecision(4); 84 std::cout << "NCD(a,b) = " << ncd(a, b) << "\n"; 85 std::cout << "NCD(a,c) = " << ncd(a, c) << "\n"; 86 std::cout << "NCD(b,c) = " << ncd(b, c) << "\n"; 87 88 return 0; 89 } 90
We implement a compact LZ78-style compressor that emits (index, next_char) pairs and use its bitcount as a practical upper bound on Kolmogorov complexity. Repetitive strings compress much better than irregular ones. With the same compressor, we compute the Normalized Compression Distance (NCD), which compares how much extra description is needed for the concatenation versus the individual compressions.
1 #include <iostream> 2 #include <string> 3 #include <cmath> 4 #include <random> 5 #include <iomanip> 6 7 // Compute prequential KT code length (in bits) for a binary string under i.i.d. Bernoulli. 8 // Probability of next 1 at time t is (n1 + 1/2) / (t + 1); similarly for 0. 9 static double ktBernoulliCodeLenBits(const std::string &s) { 10 int n0 = 0, n1 = 0; // counts so far 11 double L = 0.0; 12 for (size_t t = 0; t < s.size(); ++t) { 13 int bit = (s[t] == '1') ? 1 : 0; 14 double p = (bit == 1) ? (n1 + 0.5) / (t + 1.0) : (n0 + 0.5) / (t + 1.0); 15 L += -std::log2(p); 16 if (bit) ++n1; else ++n0; 17 } 18 return L; 19 } 20 21 // Compute prequential KT code length (in bits) for a binary string under a 1st-order Markov model. 22 // We maintain two KT predictors, one for context 0 and one for context 1. 23 static double ktMarkov1CodeLenBits(const std::string &s) { 24 if (s.empty()) return 0.0; 25 // Cost of first symbol with KT Bernoulli from scratch: p = 1/2 => 1 bit 26 double L = 1.0; 27 int c[2][2] = {{0,0},{0,0}}; // c[prev][next] 28 int tot[2] = {0,0}; // totals per prev 29 30 for (size_t i = 1; i < s.size(); ++i) { 31 int prev = (s[i-1] == '1') ? 1 : 0; 32 int bit = (s[i] == '1') ? 1 : 0; 33 double p = (c[prev][bit] + 0.5) / (tot[prev] + 1.0); 34 L += -std::log2(p); 35 c[prev][bit]++; tot[prev]++; 36 } 37 return L; 38 } 39 40 struct MDLResult { 41 std::string chosen; 42 double L_bernoulli; 43 double L_markov1; 44 }; 45 46 static MDLResult mdlSelect(const std::string &s) { 47 const int k_bern = 1; // number of parameters (asymptotically) 48 const int k_mark = 2; // p(1|0), p(1|1) 49 int n = (int)s.size(); 50 double L_D_bern = ktBernoulliCodeLenBits(s); 51 double L_D_mark = ktMarkov1CodeLenBits(s); 52 // Two-part MDL: L(M) ~ 0.5 * k * log2(n) + const; add 1 bit to identify the model 53 double L_M_bern = 1.0 + 0.5 * k_bern * (n > 0 ? std::log2((double)n) : 0.0); 54 double L_M_mark = 1.0 + 0.5 * k_mark * (n > 0 ? std::log2((double)n) : 0.0); 55 56 double L_bern = L_M_bern + L_D_bern; 57 double L_mark = L_M_mark + L_D_mark; 58 59 MDLResult r; 60 r.L_bernoulli = L_bern; 61 r.L_markov1 = L_mark; 62 r.chosen = (L_bern <= L_mark) ? "Bernoulli(iid)" : "Markov(1st-order)"; 63 return r; 64 } 65 66 int main() { 67 // Example 1: Alternating pattern should favor Markov(1) 68 std::string alt; 69 for (int i = 0; i < 200; ++i) alt.push_back((i % 2) ? '1' : '0'); 70 // Example 2: Random-looking pattern should favor Bernoulli 71 std::mt19937 rng(12345); 72 std::bernoulli_distribution coin(0.5); 73 std::string rnd; 74 for (int i = 0; i < 200; ++i) rnd.push_back(coin(rng) ? '1' : '0'); 75 76 auto R1 = mdlSelect(alt); 77 auto R2 = mdlSelect(rnd); 78 79 std::cout << std::fixed << std::setprecision(3); 80 std::cout << "ALT chosen: " << R1.chosen 81 << ", L_bern=" << R1.L_bernoulli 82 << ", L_mark=" << R1.L_markov1 << "\n"; 83 std::cout << "RND chosen: " << R2.chosen 84 << ", L_bern=" << R2.L_bernoulli 85 << ", L_mark=" << R2.L_markov1 << "\n"; 86 87 return 0; 88 } 89
We implement MDL model selection between two binary sequence models. The data code length is computed with prequential KT coding: for Bernoulli, a single KT predictor over the whole sequence; for first-order Markov, two KT predictors conditioned on the previous bit. The model cost penalizes parameters by roughly 0.5 k log2 n bits and adds 1 bit to identify the model. The chosen model minimizes total description length.