Maximum Likelihood Estimation (MLE)
Key Points
- •Maximum Likelihood Estimation (MLE) chooses parameters that make the observed data most probable under a chosen model.
- •Because multiplying many small probabilities can underflow, we usually maximize the log-likelihood, which turns products into sums.
- •For common models, MLE has closed-form solutions (e.g., Bernoulli p equals the sample mean; Gaussian mean and variance are sample mean and average squared deviation).
- •When no closed form exists (e.g., logistic regression), we use numerical optimization like gradient ascent or Newton’s method.
- •Under mild conditions, MLEs are consistent and asymptotically normal with variance given by the inverse Fisher information.
- •Independence assumptions matter: the likelihood usually factors as a product only when data are i.i.d.
- •MLE is not the same as Bayesian MAP; MLE ignores priors and uses only the likelihood.
- •In practice, use log-likelihood, check gradients/Hessians, and watch for constraints (e.g., probabilities in [0,1], variances > 0).
Prerequisites
- →Basic probability (PMF/PDF, independence) — MLE is defined via likelihoods p(D | θ); understanding independence is key to factorization.
- →Logarithms and exponentials — Log-likelihood uses log rules to turn products into sums and to avoid underflow.
- →Differential calculus — Setting gradients to zero and using Hessians are central to solving MLE problems.
- →Linear algebra — Vector/matrix notation for multivariate parameters and second-order methods uses linear algebra.
- →Numerical optimization — Many MLE problems require iterative optimization techniques like gradient ascent or Newton's method.
- →Statistics fundamentals — Concepts like bias, variance, and consistency help interpret MLE properties.
- →Programming with C++ — Implementing MLE (closed-form or iterative) needs arrays, loops, and numerical care.
Detailed Explanation
Tap terms for definitions01Overview
Maximum Likelihood Estimation (MLE) is a general strategy for fitting the parameters of a statistical model to data. The idea is simple: choose the parameter values that make the observed data most probable according to your model. Formally, given data D and a parametric model p(D | θ), the MLE θ̂ maximizes the likelihood function L(θ; D) = p(D | θ). Because likelihoods for independent observations multiply, it is numerically convenient to work with the log-likelihood, which turns products into sums and avoids floating-point underflow.
MLE is widely used across fields: in coin-flipping experiments (Bernoulli), in modeling measurement noise (Gaussian), in classification problems (logistic regression), and in complex latent-variable models with algorithms like EM. For many common distributions, MLE has neat closed-form solutions; for more complex models, we rely on iterative numerical optimization methods. MLE enjoys powerful theoretical guarantees: under regularity conditions, it is consistent (converges to the true parameter as sample size grows) and asymptotically efficient (achieves the Cramér–Rao lower bound).
02Intuition & Analogies
Imagine you are a detective evaluating different stories about how your data were generated. Each story is a set of parameter values θ for your chosen model. You ask: if that story were true, how likely is it that I would have seen exactly this data? The best story, by MLE, is the one that makes your data the least surprising.
A real-world analogy: suppose you’re tuning a virtual coin in a game so that it behaves like a coin you observed in real life. You try different bias settings p, simulate flips, and ask which p makes your observed sequence of heads and tails most plausible. The p that best explains your observations—without invoking any prior beliefs about p—is the MLE.
Why take logs? If you keep multiplying many small probabilities, the number quickly underflows to zero on a computer. Taking logs converts multiplication into addition, which is stable and easier to differentiate for optimization. This also reveals structure: the log-likelihood for independent data is a sum over data points, so each observation contributes additively. For simple models, setting the derivative (the score) to zero gives closed-form solutions. For more complex models like logistic regression, the log-likelihood is concave but not solvable in closed form; we climb the hill using gradient-based methods until the increase in log-likelihood stalls.
03Formal Definition
04When to Use
Use MLE whenever you have a parametric probabilistic model for data and you want parameter estimates grounded purely in how well the model explains the observed data, without priors. It is ideal when:
- You have i.i.d. data and a well-specified likelihood (e.g., Bernoulli for binary outcomes, Gaussian for measurement errors, Poisson for counts).
- You need point estimates that are statistically efficient with large samples, leveraging asymptotic normality and Fisher information for uncertainty quantification.
- You are fitting generalized linear models (GLMs) like logistic or Poisson regression, where the log-likelihood is concave and can be optimized reliably.
- You can exploit closed-form solutions (fast and exact) or use robust numerical optimization (gradient ascent/Newton) for models without closed forms.
Avoid or be cautious when the model is severely misspecified, data are dependent in complex ways (likelihood does not factor), the likelihood is multimodal (risk of local maxima), or when sample sizes are small and finite-sample bias matters. In such cases, consider penalized likelihoods, Bayesian methods with priors, or resampling techniques to assess uncertainty.
⚠️Common Mistakes
• Confusing MLE with MAP: MLE maximizes p(D | θ) only; MAP maximizes p(θ | D) ∝ p(D | θ)p(θ). If you include a prior or penalty, it is no longer pure MLE. • Ignoring the log: Working with raw likelihoods causes numerical underflow and unstable optimization. Always maximize the log-likelihood. • Misapplying independence: Writing L(θ) = ∏ p(x_i | θ) assumes independence given θ. If data are dependent (time series, spatial), the correct likelihood must encode dependence. • Wrong variance formula: The MLE for Gaussian variance uses 1/n, not 1/(n−1). The latter is the unbiased sample variance, not the MLE. • Violating constraints: Parameters like probabilities (0 ≤ p ≤ 1) and variances (σ^2 > 0) must respect boundaries. Optimization should enforce or reparameterize (e.g., log σ^2). • Stopping too early or using poor step sizes: In numerical MLE, an aggressive learning rate can diverge; too small wastes time. Use line search or adaptive schedules and monitor log-likelihood increase. • Overfitting and separability: In logistic regression with perfectly separable data, the MLE does not exist (coefficients diverge). Use regularization or modify the model. • Misinterpreting asymptotics: Confidence intervals from Fisher information are large-sample approximations; with small n or model misspecification, they can be misleading.
Key Formulas
MLE Definition
Explanation: The maximum likelihood estimator is the parameter that maximizes the probability (or density) of the observed data under the model. It depends on the chosen parametric family.
Likelihood and Log-likelihood for i.i.d.
Explanation: For independent data, the likelihood factors as a product over observations; taking logs turns the product into a sum. Maximizing the log-likelihood is equivalent to maximizing the likelihood.
Score Equation
Explanation: At an interior maximum of a differentiable log-likelihood, the gradient (score) is zero. Solving this equation yields candidate MLEs.
Observed and Expected Information
Explanation: The observed information is the negative Hessian at the data; its expectation is the Fisher information. These quantify curvature and inform variance estimates.
Asymptotic Normality of MLE
Explanation: As sample size grows, the MLE distribution approaches a normal centered at the true parameter with covariance given by the inverse Fisher information scaled by 1/n.
Bernoulli MLE
Explanation: For Bernoulli trials, the MLE of the success probability is the sample mean of outcomes. This matches intuition as the fraction of successes.
Gaussian (Normal) MLE
Explanation: For Normal data with unknown mean and variance, the MLEs are the sample mean and the average squared deviation using 1/n (not 1/(n−1)).
Logistic Regression Log-likelihood
Explanation: The logistic log-likelihood sums the log-probabilities assigned to binary labels under the sigmoid link. It is concave in enabling reliable optimization.
Logistic Regression Gradient and Hessian
Explanation: The gradient is the sum of feature vectors weighted by residuals; the Hessian is negative semidefinite, confirming concavity. These enable gradient or Newton methods.
Newton–Raphson Update
Explanation: A second-order update using the inverse Hessian times the gradient can converge rapidly near the optimum. In logistic regression, this corresponds to IRLS.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Compute the log-likelihood of Bernoulli data given p 5 // Data x contains 0/1 values. 6 double bernoulli_log_likelihood(const vector<int>& x, double p) { 7 // Guard against invalid p 8 if (p <= 0.0 || p >= 1.0) { 9 return -numeric_limits<double>::infinity(); 10 } 11 long long n = (long long)x.size(); 12 long long k = 0; 13 for (int xi : x) k += xi; 14 // ℓ(p) = k log p + (n-k) log(1-p) 15 return k * log(p) + (n - k) * log(1.0 - p); 16 } 17 18 // Closed-form MLE: p_hat = mean(x) 19 double bernoulli_mle(const vector<int>& x) { 20 long long n = (long long)x.size(); 21 long long k = 0; 22 for (int xi : x) k += xi; 23 return (n == 0) ? 0.5 : (double)k / (double)n; // default to 0.5 if empty 24 } 25 26 int main() { 27 // Generate synthetic Bernoulli data with true p = 0.7 28 const int n = 1000; 29 const double p_true = 0.7; 30 31 std::mt19937 rng(42); 32 std::bernoulli_distribution bern(p_true); 33 34 vector<int> x(n); 35 for (int i = 0; i < n; ++i) x[i] = bern(rng) ? 1 : 0; 36 37 double p_hat = bernoulli_mle(x); 38 39 // Evaluate log-likelihood at p_true and p_hat 40 double ll_true = bernoulli_log_likelihood(x, p_true); 41 // Handle boundary: if p_hat is 0 or 1, nudge slightly to avoid log(0) 42 double p_eval = min(1.0 - 1e-12, max(1e-12, p_hat)); 43 double ll_hat = bernoulli_log_likelihood(x, p_eval); 44 45 cout.setf(ios::fixed); cout << setprecision(6); 46 cout << "n = " << n << "\n"; 47 cout << "True p = " << p_true << ", MLE p_hat = " << p_hat << "\n"; 48 cout << "Log-likelihood at true p: " << ll_true << "\n"; 49 cout << "Log-likelihood at MLE p: " << ll_hat << "\n"; 50 return 0; 51 } 52
This program estimates the Bernoulli success probability using the MLE p̂ = mean of 0/1 outcomes. It also computes the log-likelihood at the true p and the estimated p̂ to illustrate that the MLE maximizes the log-likelihood. Logarithms avoid underflow and make the objective smooth for optimization.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct GaussianMLE { 5 double mu; 6 double sigma2; // variance 7 }; 8 9 GaussianMLE gaussian_mle(const vector<double>& x) { 10 int n = (int)x.size(); 11 if (n == 0) return {0.0, 1.0}; 12 // Compute mean 13 double sum = 0.0; 14 for (double v : x) sum += v; 15 double mu = sum / n; 16 // Compute MLE variance with 1/n (biased), not 1/(n-1) 17 double sse = 0.0; 18 for (double v : x) { 19 double d = v - mu; 20 sse += d * d; 21 } 22 double sigma2 = max(1e-30, sse / n); // enforce positivity 23 return {mu, sigma2}; 24 } 25 26 // Log-likelihood for Normal with unknown mean and variance 27 // ℓ(μ, σ^2) = -n/2 * log(2πσ^2) - (1/(2σ^2)) * Σ (x_i - μ)^2 28 double gaussian_log_likelihood(const vector<double>& x, double mu, double sigma2) { 29 if (sigma2 <= 0.0) return -numeric_limits<double>::infinity(); 30 int n = (int)x.size(); 31 double sse = 0.0; 32 for (double v : x) { 33 double d = v - mu; 34 sse += d * d; 35 } 36 return -0.5 * n * log(2.0 * M_PI * sigma2) - 0.5 * sse / sigma2; 37 } 38 39 int main() { 40 // Generate synthetic Normal data with μ=2.0, σ^2=1.5 41 const int n = 2000; 42 const double mu_true = 2.0; 43 const double sigma2_true = 1.5; 44 45 std::mt19937 rng(123); 46 std::normal_distribution<double> norm(mu_true, sqrt(sigma2_true)); 47 48 vector<double> x(n); 49 for (int i = 0; i < n; ++i) x[i] = norm(rng); 50 51 GaussianMLE est = gaussian_mle(x); 52 53 double ll_true = gaussian_log_likelihood(x, mu_true, sigma2_true); 54 double ll_hat = gaussian_log_likelihood(x, est.mu, est.sigma2); 55 56 cout.setf(ios::fixed); cout << setprecision(6); 57 cout << "n = " << n << "\n"; 58 cout << "True mu = " << mu_true << ", True sigma2 = " << sigma2_true << "\n"; 59 cout << "MLE mu = " << est.mu << ", MLE sigma2 = " << est.sigma2 << "\n"; 60 cout << "Log-likelihood at true params: " << ll_true << "\n"; 61 cout << "Log-likelihood at MLE params: " << ll_hat << "\n"; 62 return 0; 63 } 64
This program computes the MLEs for a univariate Normal distribution: the sample mean and the average squared deviation using 1/n. It also evaluates the log-likelihood at the true parameters and at the MLE, illustrating that the MLE gives the highest log-likelihood. The code enforces σ^2 > 0 for numerical stability.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Sigmoid with numerical stability 5 double sigmoid(double z) { 6 if (z >= 0) { 7 double ez = exp(-z); 8 return 1.0 / (1.0 + ez); 9 } else { 10 double ez = exp(z); 11 return ez / (1.0 + ez); 12 } 13 } 14 15 // Compute log-likelihood and gradient for logistic regression 16 // X: n x d design matrix (row-major as vector<vector<double>>) 17 // y: n binary labels {0,1} 18 // w: d parameters (including intercept if present) 19 double logistic_loglik_and_grad(const vector<vector<double>>& X, 20 const vector<int>& y, 21 const vector<double>& w, 22 vector<double>& grad) { 23 int n = (int)X.size(); 24 int d = (int)w.size(); 25 fill(grad.begin(), grad.end(), 0.0); 26 double ll = 0.0; 27 for (int i = 0; i < n; ++i) { 28 // compute z = w^T x_i 29 double z = 0.0; 30 for (int j = 0; j < d; ++j) z += w[j] * X[i][j]; 31 double p = sigmoid(z); 32 // log-likelihood contribution: y log p + (1-y) log(1-p) 33 // guard against log(0) 34 double eps = 1e-12; 35 p = min(1.0 - eps, max(eps, p)); 36 ll += y[i] * log(p) + (1 - y[i]) * log(1.0 - p); 37 // gradient contribution: x_i * (y - p) 38 double r = (double)y[i] - p; 39 for (int j = 0; j < d; ++j) grad[j] += X[i][j] * r; 40 } 41 return ll; 42 } 43 44 int main() { 45 // Generate synthetic linearly separable-ish data but with noise 46 int n = 2000; // samples 47 int d = 3; // features including intercept 48 49 std::mt19937 rng(7); 50 std::normal_distribution<double> N01(0.0, 1.0); 51 52 // True weights (including intercept) 53 vector<double> w_true = { -0.5, 2.0, -1.0 }; 54 55 // Build dataset X (with intercept column 1.0) and labels y 56 vector<vector<double>> X(n, vector<double>(d, 1.0)); 57 vector<int> y(n); 58 for (int i = 0; i < n; ++i) { 59 double x1 = N01(rng); 60 double x2 = N01(rng) + 0.5; // small shift 61 X[i][0] = 1.0; // intercept 62 X[i][1] = x1; 63 X[i][2] = x2; 64 double z = w_true[0] * X[i][0] + w_true[1] * X[i][1] + w_true[2] * X[i][2]; 65 double p = sigmoid(z); 66 std::bernoulli_distribution Bern(p); 67 y[i] = Bern(rng) ? 1 : 0; 68 } 69 70 // Initialize parameters to zeros 71 vector<double> w(d, 0.0); 72 73 // Gradient ascent with backtracking line search 74 double ll = -numeric_limits<double>::infinity(); 75 double prev_ll = ll; 76 vector<double> grad(d, 0.0); 77 78 int max_iters = 500; 79 for (int t = 0; t < max_iters; ++t) { 80 ll = logistic_loglik_and_grad(X, y, w, grad); 81 if (t % 20 == 0) { 82 cout << "Iter " << t << ": log-likelihood = " << ll << "\n"; 83 } 84 // Backtracking to ensure improvement 85 double step = 1.0; // initial step size 86 double c = 1e-4; // sufficient increase parameter 87 double ll_new; 88 vector<double> w_new(d); 89 // Try decreasing step sizes until improvement 90 for (int bt = 0; bt < 20; ++bt) { 91 for (int j = 0; j < d; ++j) w_new[j] = w[j] + step * grad[j]; 92 vector<double> tmpg(d, 0.0); 93 ll_new = logistic_loglik_and_grad(X, y, w_new, tmpg); 94 if (ll_new >= ll + c * step * inner_product(grad.begin(), grad.end(), grad.begin(), 0.0)) { 95 break; // sufficient increase achieved 96 } 97 step *= 0.5; 98 } 99 // Update 100 w = w_new; 101 // Stopping based on small relative improvement 102 if (t > 5 && fabs(ll_new - ll) < 1e-6 * (1.0 + fabs(ll))) { 103 cout << "Converged at iteration " << t << "\n"; 104 break; 105 } 106 } 107 108 // Evaluate training accuracy 109 int correct = 0; 110 for (int i = 0; i < n; ++i) { 111 double z = 0.0; 112 for (int j = 0; j < d; ++j) z += w[j] * X[i][j]; 113 int yhat = sigmoid(z) >= 0.5 ? 1 : 0; 114 correct += (yhat == y[i]); 115 } 116 117 cout.setf(ios::fixed); cout << setprecision(6); 118 cout << "Estimated weights: "; 119 for (double wi : w) cout << wi << ' '; 120 cout << "\nTraining accuracy: " << (100.0 * correct / n) << "%\n"; 121 return 0; 122 } 123
This program fits a logistic regression model by maximizing the log-likelihood using gradient ascent with a simple backtracking line search. Each iteration computes the gradient in O(n d), proposes a step, and accepts it if the log-likelihood improves sufficiently. The sigmoid is implemented in a numerically stable way, and an intercept is included as the first feature.