🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
📚TheoryIntermediate

Minimum Description Length (MDL)

Key Points

  • •
    Minimum Description Length (MDL) picks the model that compresses the data best by minimizing L(M) + L(D|M).
  • •
    L(M) is the number of bits to describe the model, and L(D|M) is the number of bits to describe the data once the model is known.
  • •
    In practice, MDL often reduces to maximizing likelihood with an explicit penalty for model complexity, like the BIC term 2k​ log n.
  • •
    MDL prevents overfitting by charging for every parameter you add, so extra model flexibility must pay for itself in shorter residual codes.
  • •
    You can implement MDL with two-part codes (send parameters, then residuals) or one-part codes (universal codes like NML).
  • •
    For Gaussian noise, MDL uses the residual sum of squares with a k log n penalty, making it easy to compare polynomial degrees.
  • •
    For histograms, MDL balances the data entropy against the (k − 1)/2 log n penalty for k bins to avoid too many bins.
  • •
    Using natural logs yields code lengths in nats; divide by ln 2 to convert to bits, and be consistent across all terms.

Prerequisites

  • →Basic probability and distributions — MDL uses likelihoods and probabilities to translate data fit into code lengths.
  • →Linear algebra — Polynomial regression fitting requires solving linear systems from normal equations.
  • →Information theory (entropy, prefix codes) — MDL is grounded in coding theory; understanding entropy and code lengths is essential.
  • →Asymptotic statistics and MLE — The MDL/BIC penalty (k/2) log n relies on asymptotic properties of maximum likelihood estimators.
  • →Numerical stability in linear solvers — Fitting models (e.g., high-degree polynomials) can be ill-conditioned without stable solvers.

Detailed Explanation

Tap terms for definitions

01Overview

Minimum Description Length (MDL) is a general principle for model selection rooted in data compression. It says the best explanation for data is the one that minimizes the total description length: the cost to describe the model plus the cost to describe the data given that model. Instead of only maximizing fit (likelihood), MDL adds a precise penalty for complexity, measured in bits. This turns model choice into a coding problem: which model lets us transmit the dataset using the fewest bits overall? MDL is closely related to Occam’s razor, Kolmogorov complexity, and Bayesian model selection. In practice, MDL often uses approximations that are easy to compute, such as the Bayesian Information Criterion (BIC), which adds a (k/2) log n term to the negative log-likelihood for k parameters and n data points. MDL applies widely: picking polynomial degree in regression, selecting histogram bin counts, choosing the number of clusters, or deciding the order of a Markov model. The principle is robust: it guards against overfitting by requiring that any extra parameter must reduce the residual code length enough to offset its own description cost. Because description lengths depend on coding assumptions, practical MDL requires you to specify how parameters and residuals are encoded, but common choices lead to simple, effective criteria.

02Intuition & Analogies

Imagine you must email a large dataset to a friend who knows nothing about it. You can either send the raw data, or first send a short program (the model) that can regenerate most of it and then send the small differences (residuals). If the model captures real structure, the residuals will be short, and the total message is smaller. If the model is too simple, residuals are long because it misses structure. If the model is too complex, the program itself is long, and although residuals shrink, the total message can grow. MDL formalizes this trade-off: total length equals "program/model" length plus "residual" length. Think of modeling a noisy curve with polynomials. A straight line (degree 1) is a tiny program, but might leave big residuals; a 20th-degree polynomial overfits, making the program long while residuals are tiny. MDL asks: which degree gives the shortest combined message? Another analogy is packing for a trip with luggage fees. More outfits (parameters) give flexibility (lower residual costs) but increase your luggage fee (model cost). You optimize total expense, not just style. For histograms, more bins fit the data's wiggles (lower within-bin uncertainty), but each bin costs because you must specify its probability or count. MDL picks the bin count that yields the shortest encoded dataset when both the bin definitions and the bin counts are considered.

03Formal Definition

Given a model class \(M\) and data \(D\), MDL chooses the model \(M ∈ M\) minimizing total code length \(L(M) + L(D∣ M)\). A code length function \(L(⋅)\) must be prefix-free, satisfying Kraft’s inequality \(∑x​ 2^{-L(x)} ≤ 1\). In two-part MDL for parametric models \(M = (F, θ)\), we encode the model index and parameters (with some precision), then encode the data assuming distribution \(pθ​\): \(L(D, M) = L(M) - log pθ​(D)\) up to a log base. One-part MDL (stochastic complexity) uses a universal code like normalized maximum likelihood (NML): \(L(D, F) = -log p_{θ^}(D) + log ∑D′​ p_{θ^(D')}(D')\), where \(θ^\) is the MLE. For regular models with large \(n\), both lead to the same asymptotic form: \(L(D, M) = -log p_{θ^}(D) + 2k​ log n + O(1)\), where \(k\) is the number of free parameters. This is the BIC/MDL penalty. In Gaussian regression with MLE variance \(σ^2 = RSS/n\), we have \(-log p_{θ^}(D) = \tfrac{n}{2}\big[1 + log(2π σ^2)\big]\). All logs must be with the same base: natural logs give lengths in nats; divide by \(log 2\) to convert to bits. MDL thus operationalizes Occam’s razor: choose the model minimizing code length.

04When to Use

Use MDL whenever you must choose among models or hyperparameters and you want a principled, data-driven balance between fit and complexity. Examples include: selecting polynomial degree in regression; determining the number of histogram bins; picking the number of clusters (with a probabilistic clustering model); choosing the order of a Markov chain for sequences; selecting features or regularization strength when these can be interpreted as affecting code length. MDL is especially attractive when likelihoods are computable and when models are nested or comparable by parameter count, making the penalty term easy to evaluate. It is robust in small samples compared to validation splits when data are scarce, although cross-validation remains a strong alternative. MDL is also useful for change-point detection and piecewise models via penalties per segment or per parameter. Avoid MDL when coding assumptions are unclear or misspecified (e.g., heavy tails but using Gaussian code), since the computed code lengths may not reflect true compressibility. Also, if models are non-regular or parameter counting is ambiguous, the standard (\tfrac{k}{2}\log n) penalty may fail; consider refined MDL or NML if feasible. Finally, if computing maximum likelihood is intractable, the practicality of MDL diminishes unless good approximations are available.

⚠️Common Mistakes

• Mixing units: using base-2 (bits) for one term and natural logs (nats) for another. Always use a single log base; convert with a factor of 1/(\log 2) if needed. • Double-penalizing parameters: adding both a hand-crafted regularizer (like L2) and an MDL/BIC penalty without accounting for their combined effect. Either interpret the regularizer as part of the code or compare models consistently without extra ad hoc penalties. • Ignoring the parameter precision in two-part codes: encoding real-valued parameters requires specifying precision; omitting this underestimates (L(M)). Use asymptotic penalties or well-defined priors/quantization rules. • Comparing non-comparable models: MDL comparisons must use the same data coding assumptions. Don’t compare code lengths from different likelihood families or different log bases. • Over-trusting asymptotics in tiny samples: the (\tfrac{k}{2}\log n) penalty is asymptotic; for very small (n), constants matter. Consider cross-validation or exact universal codes if possible. • Counting parameters incorrectly: in constrained models (e.g., multinomial with probabilities summing to 1), effective parameters are (k-1), not (k). Similarly, include variance parameters when applicable. • Using raw RSS only: MDL requires the full negative log-likelihood (including the (\log \hat{\sigma}^2) term), not just RSS, to be comparable across models.

Key Formulas

Total Description Length

L(M,D)=L(M)+L(D∣M)

Explanation: The total number of bits to transmit the model and then the data given that model. MDL selects the model minimizing this sum.

Asymptotic MDL/BIC

L(D,M)=−logp(D∣θ^M​)+2k​logn+O(1)

Explanation: For regular models with k parameters and n samples, the optimal description length equals the negative log-likelihood at the MLE plus a complexity penalty. This is the common practical MDL approximation.

Gaussian Negative Log-Likelihood at MLE

−logp(D∣θ^)=2n​(1+log(2πσ^2))

Explanation: In Gaussian regression, with variance estimated by \(σ^2 = RSS/n\), the negative log-likelihood reduces to this closed form. It provides the data code length under Gaussian noise.

MLE Variance Estimate

σ^2=n1​i=1∑n​(yi​−y^​i​)2=nRSS​

Explanation: The maximum likelihood estimate of the noise variance equals the mean squared residual. Plugging this into the Gaussian likelihood yields the code length used by MDL.

BIC (base-e)

BIC=−2logL^+klogn

Explanation: The Bayesian Information Criterion is equivalent to twice the MDL objective up to constants. Minimizing BIC is equivalent to minimizing MDL with natural logs.

Empirical Entropy of Histogram

H(p^​)=−j=1∑k​ncj​​logncj​​

Explanation: For histogram bin counts \(cj​\), the empirical entropy gives the per-sample code length when encoding samples with the MLE bin probabilities. Multiply by n for the total data code length.

MDL for Histogram (BIC form)

Lhist​(D,k)=−j=1∑k​cj​logncj​​+2k−1​logn

Explanation: The total description length for a k-bin multinomial model equals the negative log-likelihood plus a penalty for the k−1 free parameters. This balances fit against over-partitioning.

Kraft's Inequality

x∑​2−L(x)≤1

Explanation: Any decodable prefix code must satisfy this inequality. It underpins the interpretation of −log probabilities as code lengths.

Nats-to-Bits Conversion

Lbits​=log2Lnats​​

Explanation: If you compute lengths with natural logs (nats), divide by log 2 to convert to bits. Keep units consistent when comparing models.

Complexity Analysis

Computationally, MDL-based selection typically requires: (1) fitting each candidate model (to compute the maximum likelihood or residual-related terms), and (2) evaluating a closed-form penalty (e.g., 2k​ log n). The primary cost is fitting models across a grid of hyperparameters. For polynomial regression of maximum degree dm​ax, using normal equations yields O(n d2 + d3) per degree (building moment matrices and solving a k×k system, k=d+1). Scanning all degrees gives O(n dm​ax^2 + dm​ax^3) overall. Memory is O(dm​ax^2) for the normal matrix. More numerically stable solvers (QR) have similar asymptotic cost but larger constants. For histogram bin selection with km​ax bins, you can compute counts and code lengths for each k in O(n) by a single pass per k, so total time is O(n km​ax) and memory is O(km​ax). In many MDL applications, models are nested; you can often reuse intermediate computations (e.g., polynomial sums of powers) to reduce cost from O(n dm​ax^2) to O(n dm​ax) by caching sufficient statistics. In probabilistic clustering or Markov models, each candidate k or order r requires iterative optimization (e.g., EM), making the MDL search cost O(Tf​it(n, k)) per candidate; pruning or warm starts can cut runtime. The penalty evaluation itself is negligible (O(1)). In summary, MDL’s computational burden is dominated by the underlying model fitting; the MDL component adds only lightweight calculations, so careful algorithmic engineering (caching, incremental updates, stable solvers) determines overall performance.

Code Examples

Select polynomial degree with MDL (Gaussian noise, BIC penalty)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Solve linear system A x = b via Gaussian elimination with partial pivoting
5vector<double> gaussianSolve(vector<vector<double>> A, vector<double> b) {
6 int n = (int)A.size();
7 for (int i = 0; i < n; ++i) A[i].push_back(b[i]); // augment
8 for (int col = 0; col < n; ++col) {
9 // Pivot
10 int piv = col;
11 for (int r = col + 1; r < n; ++r)
12 if (fabs(A[r][col]) > fabs(A[piv][col])) piv = r;
13 if (fabs(A[piv][col]) < 1e-12) {
14 // Singular; return zeros as a fallback
15 return vector<double>(n, 0.0);
16 }
17 swap(A[piv], A[col]);
18 // Normalize row
19 double diag = A[col][col];
20 for (int j = col; j <= n; ++j) A[col][j] /= diag;
21 // Eliminate other rows
22 for (int r = 0; r < n; ++r) if (r != col) {
23 double factor = A[r][col];
24 if (fabs(factor) < 1e-18) continue;
25 for (int j = col; j <= n; ++j) A[r][j] -= factor * A[col][j];
26 }
27 }
28 vector<double> x(n);
29 for (int i = 0; i < n; ++i) x[i] = A[i][n];
30 return x;
31}
32
33// Fit polynomial of degree d to (x, y) via normal equations, return coefficients c[0..d]
34vector<double> fitPolynomial(const vector<double>& x, const vector<double>& y, int d) {
35 int n = (int)x.size();
36 int k = d + 1;
37 // Precompute sums S_m = sum x_i^m for m = 0..2d
38 vector<double> S(2*d + 1, 0.0);
39 for (int i = 0; i < n; ++i) {
40 double xp = 1.0;
41 for (int m = 0; m <= 2*d; ++m) {
42 S[m] += xp;
43 xp *= x[i];
44 }
45 }
46 // Build normal matrix G and right-hand side h
47 vector<vector<double>> G(k, vector<double>(k, 0.0));
48 vector<double> h(k, 0.0);
49 for (int r = 0; r < k; ++r) {
50 for (int c = 0; c < k; ++c) {
51 G[r][c] = S[r + c]; // sum x^(r+c)
52 }
53 for (int i = 0; i < n; ++i) {
54 double xp = 1.0;
55 for (int t = 0; t < r; ++t) xp *= x[i];
56 h[r] += xp * y[i];
57 }
58 }
59 return gaussianSolve(G, h);
60}
61
62// Compute MDL (in nats) for polynomial of degree d using Gaussian likelihood and BIC penalty
63// MDL = -log L(\hat{theta}) + (k/2) log n, with k = d + 1 (coefficients) + 1 (variance)
64// Here we fold variance into the likelihood via sigma^2_hat = RSS/n; the penalty uses k_params = d + 1.
65double mdlPolynomial(const vector<double>& x, const vector<double>& y, int d) {
66 int n = (int)x.size();
67 vector<double> c = fitPolynomial(x, y, d);
68 // Compute predictions and RSS
69 double RSS = 0.0;
70 for (int i = 0; i < n; ++i) {
71 double xp = 1.0, yi_hat = 0.0;
72 for (int j = 0; j <= d; ++j) {
73 yi_hat += c[j] * xp;
74 xp *= x[i];
75 }
76 double r = y[i] - yi_hat;
77 RSS += r * r;
78 }
79 double sigma2_hat = RSS / max(1, n);
80 if (sigma2_hat <= 1e-30) sigma2_hat = 1e-30; // numerical guard
81 // Negative log-likelihood at MLE for Gaussian
82 double nll = 0.5 * n * (1.0 + log(2.0 * M_PI * sigma2_hat));
83 int k_params = d + 1; // effective free parameters for mean model; variance is accounted in nll
84 double penalty = 0.5 * k_params * log((double)max(1, n));
85 return nll + penalty;
86}
87
88int main() {
89 ios::sync_with_stdio(false);
90 cin.tie(nullptr);
91
92 // Generate synthetic data: y = 2 - 0.5 x + 0.3 x^2 + noise
93 int n = 200;
94 vector<double> x(n), y(n);
95 mt19937 rng(42);
96 normal_distribution<double> noise(0.0, 0.5);
97 for (int i = 0; i < n; ++i) {
98 x[i] = -3.0 + 6.0 * (double)i / (n - 1);
99 double yi = 2.0 - 0.5 * x[i] + 0.3 * x[i] * x[i];
100 y[i] = yi + noise(rng);
101 }
102
103 int best_d = 0;
104 double best_mdl = numeric_limits<double>::infinity();
105 int d_max = 10;
106 for (int d = 0; d <= d_max; ++d) {
107 double mdl = mdlPolynomial(x, y, d);
108 cout << "degree=" << d << " MDL(nats)=" << mdl << "\n";
109 if (mdl < best_mdl) { best_mdl = mdl; best_d = d; }
110 }
111 cout << "Best degree by MDL: " << best_d << "\n";
112 cout << "(To convert nats to bits, divide by log(2).)\n";
113 return 0;
114}
115

This program selects the polynomial degree that minimizes the MDL objective using the Gaussian likelihood and the BIC penalty. It fits each degree via normal equations, computes the residual sum of squares to get the MLE variance, evaluates the negative log-likelihood, and adds (k/2) log n with k = d + 1. The degree with the smallest MDL (in nats) is chosen. The code uses cached power sums to build the normal matrix efficiently and a simple Gaussian elimination solver.

Time: O(n d_max^2 + d_max^3)Space: O(d_max^2)
Select histogram bin count with MDL (multinomial likelihood + BIC penalty)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute MDL in nats for k-bin histogram over data in [minV, maxV]
5// MDL = -sum_j c_j log(c_j / n) + ((k-1)/2) log n
6// Assumes bins are equal-width and probabilities estimated by MLE c_j / n.
7double mdlHistogram(const vector<double>& data, int k, double minV, double maxV) {
8 int n = (int)data.size();
9 if (k <= 0) return numeric_limits<double>::infinity();
10 vector<int> cnt(k, 0);
11 double width = (maxV - minV) / k;
12 if (width <= 0) return numeric_limits<double>::infinity();
13 for (double v : data) {
14 int b = (int)floor((v - minV) / width);
15 if (b < 0) b = 0;
16 if (b >= k) b = k - 1; // include right endpoint
17 cnt[b]++;
18 }
19 // Negative log-likelihood at MLE for multinomial: -sum c_j log(c_j / n)
20 double nll = 0.0;
21 for (int c : cnt) {
22 if (c > 0) nll += - (double)c * (log((double)c) - log((double)n)); // -c log(c/n)
23 }
24 // Penalty for (k-1) free parameters
25 double penalty = 0.5 * max(0, k - 1) * log((double)max(1, n));
26 return nll + penalty;
27}
28
29int main() {
30 ios::sync_with_stdio(false);
31 cin.tie(nullptr);
32
33 // Generate data: mixture of two Gaussians on [0, 1]
34 int n = 5000;
35 mt19937 rng(123);
36 normal_distribution<double> g1(0.3, 0.05), g2(0.7, 0.05);
37 uniform_real_distribution<double> U(0.0, 1.0);
38 vector<double> data; data.reserve(n);
39 for (int i = 0; i < n; ++i) {
40 double v = (U(rng) < 0.5) ? g1(rng) : g2(rng);
41 v = min(1.0, max(0.0, v));
42 data.push_back(v);
43 }
44
45 double minV = 0.0, maxV = 1.0;
46 int k_best = 1; double best_mdl = numeric_limits<double>::infinity();
47 int k_max = 100;
48 for (int k = 1; k <= k_max; ++k) {
49 double mdl = mdlHistogram(data, k, minV, maxV);
50 if (mdl < best_mdl) { best_mdl = mdl; k_best = k; }
51 if (k % 10 == 0) {
52 cout << "k=" << k << " MDL(nats)=" << mdl << "\n";
53 }
54 }
55 cout << "Best bin count by MDL: " << k_best << "\n";
56 cout << "(Equal-width bins; convert to bits by dividing by log(2).)\n";
57 return 0;
58}
59

This program chooses the number of equal-width histogram bins by minimizing the MDL objective for a multinomial model. For each k, it computes counts, evaluates the negative log-likelihood at the MLE p_j = c_j/n, and adds the (k−1)/2 log n penalty. The selected k balances fidelity (lower empirical entropy) and complexity (more parameters).

Time: O(n k_max)Space: O(k_max)
#minimum description length#mdl#bic#model selection#information theory#compression#negative log-likelihood#universal coding#histogram binning#polynomial regression#overfitting#entropy#kolmogorov complexity#stochastic complexity#normalized maximum likelihood