🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
∑MathIntermediate

Maximum Likelihood Estimation (MLE)

Key Points

  • •
    Maximum Likelihood Estimation (MLE) chooses parameters that make the observed data most probable under a chosen model.
  • •
    Because multiplying many small probabilities can underflow, we usually maximize the log-likelihood, which turns products into sums.
  • •
    For common models, MLE has closed-form solutions (e.g., Bernoulli p equals the sample mean; Gaussian mean and variance are sample mean and average squared deviation).
  • •
    When no closed form exists (e.g., logistic regression), we use numerical optimization like gradient ascent or Newton’s method.
  • •
    Under mild conditions, MLEs are consistent and asymptotically normal with variance given by the inverse Fisher information.
  • •
    Independence assumptions matter: the likelihood usually factors as a product only when data are i.i.d.
  • •
    MLE is not the same as Bayesian MAP; MLE ignores priors and uses only the likelihood.
  • •
    In practice, use log-likelihood, check gradients/Hessians, and watch for constraints (e.g., probabilities in [0,1], variances > 0).

Prerequisites

  • →Basic probability (PMF/PDF, independence) — MLE is defined via likelihoods p(D | θ); understanding independence is key to factorization.
  • →Logarithms and exponentials — Log-likelihood uses log rules to turn products into sums and to avoid underflow.
  • →Differential calculus — Setting gradients to zero and using Hessians are central to solving MLE problems.
  • →Linear algebra — Vector/matrix notation for multivariate parameters and second-order methods uses linear algebra.
  • →Numerical optimization — Many MLE problems require iterative optimization techniques like gradient ascent or Newton's method.
  • →Statistics fundamentals — Concepts like bias, variance, and consistency help interpret MLE properties.
  • →Programming with C++ — Implementing MLE (closed-form or iterative) needs arrays, loops, and numerical care.

Detailed Explanation

Tap terms for definitions

01Overview

Maximum Likelihood Estimation (MLE) is a general strategy for fitting the parameters of a statistical model to data. The idea is simple: choose the parameter values that make the observed data most probable according to your model. Formally, given data D and a parametric model p(D | θ), the MLE θ̂ maximizes the likelihood function L(θ; D) = p(D | θ). Because likelihoods for independent observations multiply, it is numerically convenient to work with the log-likelihood, which turns products into sums and avoids floating-point underflow.

MLE is widely used across fields: in coin-flipping experiments (Bernoulli), in modeling measurement noise (Gaussian), in classification problems (logistic regression), and in complex latent-variable models with algorithms like EM. For many common distributions, MLE has neat closed-form solutions; for more complex models, we rely on iterative numerical optimization methods. MLE enjoys powerful theoretical guarantees: under regularity conditions, it is consistent (converges to the true parameter as sample size grows) and asymptotically efficient (achieves the Cramér–Rao lower bound).

02Intuition & Analogies

Imagine you are a detective evaluating different stories about how your data were generated. Each story is a set of parameter values θ for your chosen model. You ask: if that story were true, how likely is it that I would have seen exactly this data? The best story, by MLE, is the one that makes your data the least surprising.

A real-world analogy: suppose you’re tuning a virtual coin in a game so that it behaves like a coin you observed in real life. You try different bias settings p, simulate flips, and ask which p makes your observed sequence of heads and tails most plausible. The p that best explains your observations—without invoking any prior beliefs about p—is the MLE.

Why take logs? If you keep multiplying many small probabilities, the number quickly underflows to zero on a computer. Taking logs converts multiplication into addition, which is stable and easier to differentiate for optimization. This also reveals structure: the log-likelihood for independent data is a sum over data points, so each observation contributes additively. For simple models, setting the derivative (the score) to zero gives closed-form solutions. For more complex models like logistic regression, the log-likelihood is concave but not solvable in closed form; we climb the hill using gradient-based methods until the increase in log-likelihood stalls.

03Formal Definition

Given data D = {x1​, x2​, …, xn​} and a parametric family of densities or mass functions p(x ∣ θ), the likelihood function is L(θ; D) = p(D ∣ θ). For i.i.d. data, L(θ; D) = ∏i=1n​ p(xi​ ∣ θ). The Maximum Likelihood Estimator (MLE) is defined as \[ θ^MLE​ = \operatorname*{arg\,max}_{θ ∈ Θ} L(θ; D) = \operatorname*{arg\,max}_{θ ∈ Θ} ℓ(θ; D), \] where \(ℓ(θ; D) = log L(θ; D)\) is the log-likelihood. Under regularity conditions, a stationary point satisfies the score equation \(∇θ​ ℓ(θ; D) = 0\), and the observed Fisher information is \(I(θ) = -∇θ2​ ℓ(θ; D)\). Asymptotically, \(n​(θ^ - θ∗) d​ N(0, I(θ∗)^{-1})\), where \(I(θ) = E[I(θ)]\) is the Fisher information. Examples: Bernoulli(p): \(p^​ = n1​∑ xi​\). Univariate Normal(\(μ, σ2\)): \(μ^​ = xˉ\), \(σ^2 = n1​ ∑ (xi​ - xˉ)^{2}\). Logistic regression: maximize \(∑i​ [yi​ log σ(β⊤xi​) + (1 - yi​) log (1 - σ(β⊤xi​))]\) over \(β\), where \(σ(t) = (1 + e−t)^{-1}\).

04When to Use

Use MLE whenever you have a parametric probabilistic model for data and you want parameter estimates grounded purely in how well the model explains the observed data, without priors. It is ideal when:

  • You have i.i.d. data and a well-specified likelihood (e.g., Bernoulli for binary outcomes, Gaussian for measurement errors, Poisson for counts).
  • You need point estimates that are statistically efficient with large samples, leveraging asymptotic normality and Fisher information for uncertainty quantification.
  • You are fitting generalized linear models (GLMs) like logistic or Poisson regression, where the log-likelihood is concave and can be optimized reliably.
  • You can exploit closed-form solutions (fast and exact) or use robust numerical optimization (gradient ascent/Newton) for models without closed forms.

Avoid or be cautious when the model is severely misspecified, data are dependent in complex ways (likelihood does not factor), the likelihood is multimodal (risk of local maxima), or when sample sizes are small and finite-sample bias matters. In such cases, consider penalized likelihoods, Bayesian methods with priors, or resampling techniques to assess uncertainty.

⚠️Common Mistakes

• Confusing MLE with MAP: MLE maximizes p(D | θ) only; MAP maximizes p(θ | D) ∝ p(D | θ)p(θ). If you include a prior or penalty, it is no longer pure MLE. • Ignoring the log: Working with raw likelihoods causes numerical underflow and unstable optimization. Always maximize the log-likelihood. • Misapplying independence: Writing L(θ) = ∏ p(x_i | θ) assumes independence given θ. If data are dependent (time series, spatial), the correct likelihood must encode dependence. • Wrong variance formula: The MLE for Gaussian variance uses 1/n, not 1/(n−1). The latter is the unbiased sample variance, not the MLE. • Violating constraints: Parameters like probabilities (0 ≤ p ≤ 1) and variances (σ^2 > 0) must respect boundaries. Optimization should enforce or reparameterize (e.g., log σ^2). • Stopping too early or using poor step sizes: In numerical MLE, an aggressive learning rate can diverge; too small wastes time. Use line search or adaptive schedules and monitor log-likelihood increase. • Overfitting and separability: In logistic regression with perfectly separable data, the MLE does not exist (coefficients diverge). Use regularization or modify the model. • Misinterpreting asymptotics: Confidence intervals from Fisher information are large-sample approximations; with small n or model misspecification, they can be misleading.

Key Formulas

MLE Definition

θ^MLE​=θ∈Θargmax​p(D∣θ)

Explanation: The maximum likelihood estimator is the parameter that maximizes the probability (or density) of the observed data under the model. It depends on the chosen parametric family.

Likelihood and Log-likelihood for i.i.d.

L(θ;D)=i=1∏n​p(xi​∣θ),ℓ(θ;D)=logL(θ;D)=i=1∑n​logp(xi​∣θ)

Explanation: For independent data, the likelihood factors as a product over observations; taking logs turns the product into a sum. Maximizing the log-likelihood is equivalent to maximizing the likelihood.

Score Equation

∇θ​ℓ(θ;D)=0

Explanation: At an interior maximum of a differentiable log-likelihood, the gradient (score) is zero. Solving this equation yields candidate MLEs.

Observed and Expected Information

I(θ)=−∇θ2​ℓ(θ;D),I(θ)=E[I(θ)]

Explanation: The observed information is the negative Hessian at the data; its expectation is the Fisher information. These quantify curvature and inform variance estimates.

Asymptotic Normality of MLE

n​(θ^−θ∗)d​N(0,I(θ∗)−1)

Explanation: As sample size grows, the MLE distribution approaches a normal centered at the true parameter with covariance given by the inverse Fisher information scaled by 1/n.

Bernoulli MLE

p^​=n1​i=1∑n​xi​

Explanation: For Bernoulli trials, the MLE of the success probability is the sample mean of 10​ outcomes. This matches intuition as the fraction of successes.

Gaussian (Normal) MLE

μ^​=xˉ=n1​i=1∑n​xi​,σ^2=n1​i=1∑n​(xi​−xˉ)2

Explanation: For Normal data with unknown mean and variance, the MLEs are the sample mean and the average squared deviation using 1/n (not 1/(n−1)).

Logistic Regression Log-likelihood

ℓ(β)=i=1∑n​[yi​logσ(β⊤xi​)+(1−yi​)log(1−σ(β⊤xi​))],σ(t)=1+e−t1​

Explanation: The logistic log-likelihood sums the log-probabilities assigned to binary labels under the sigmoid link. It is concave in β, enabling reliable optimization.

Logistic Regression Gradient and Hessian

∇β​ℓ(β)=i=1∑n​xi​(yi​−σ(β⊤xi​)),H(β)=−i=1∑n​σ(zi​)(1−σ(zi​))xi​xi⊤​

Explanation: The gradient is the sum of feature vectors weighted by residuals; the Hessian is negative semidefinite, confirming concavity. These enable gradient or Newton methods.

Newton–Raphson Update

βt+1​=βt​−H(βt​)−1∇β​ℓ(βt​)

Explanation: A second-order update using the inverse Hessian times the gradient can converge rapidly near the optimum. In logistic regression, this corresponds to IRLS.

Complexity Analysis

The computational cost of MLE depends on whether a closed-form solution exists or numerical optimization is required. For closed-form models like Bernoulli or univariate Gaussian, computing sufficient statistics (sum, count, sum of squares) is linear in the number of samples n and dimensions d (often d=1). Specifically, one pass over the data costs O(n d) time and O(1) extra space beyond the data to store running totals. For numerical MLE with first-order methods (e.g., gradient ascent) on a model with parameter vector of size d, each iteration typically requires computing the log-likelihood and gradient, which costs O(n d) time and O(d) space for accumulators. The number of iterations T to reach tolerance depends on step size, conditioning, and curvature; overall complexity is O(T n d). For logistic regression, the log-likelihood is concave, so gradient ascent with line search converges reliably; T often ranges from tens to a few hundred. Second-order methods IRLSNewton​ require forming and solving linear systems involving the d×d Hessian. Forming the Hessian costs O(n d2), and solving via Cholesky is O(d3). This is advantageous when d is small to moderate and yields fast (often quadratic) convergence in T ≪ first-order methods. Memory requirements grow to O(d2) to store the Hessian. When constraints are handled via reparameterization (e.g., log σ2) there is negligible additional cost. For very large n, stochastic or mini-batch variants reduce per-iteration costs to O(b d) with batch size b, trading more iterations for cheaper updates. In all cases, using the log-likelihood improves numerical stability without changing big-O complexity.

Code Examples

Closed-form MLE for Bernoulli Parameter (p) with Log-likelihood
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute the log-likelihood of Bernoulli data given p
5// Data x contains 0/1 values.
6double bernoulli_log_likelihood(const vector<int>& x, double p) {
7 // Guard against invalid p
8 if (p <= 0.0 || p >= 1.0) {
9 return -numeric_limits<double>::infinity();
10 }
11 long long n = (long long)x.size();
12 long long k = 0;
13 for (int xi : x) k += xi;
14 // ℓ(p) = k log p + (n-k) log(1-p)
15 return k * log(p) + (n - k) * log(1.0 - p);
16}
17
18// Closed-form MLE: p_hat = mean(x)
19double bernoulli_mle(const vector<int>& x) {
20 long long n = (long long)x.size();
21 long long k = 0;
22 for (int xi : x) k += xi;
23 return (n == 0) ? 0.5 : (double)k / (double)n; // default to 0.5 if empty
24}
25
26int main() {
27 // Generate synthetic Bernoulli data with true p = 0.7
28 const int n = 1000;
29 const double p_true = 0.7;
30
31 std::mt19937 rng(42);
32 std::bernoulli_distribution bern(p_true);
33
34 vector<int> x(n);
35 for (int i = 0; i < n; ++i) x[i] = bern(rng) ? 1 : 0;
36
37 double p_hat = bernoulli_mle(x);
38
39 // Evaluate log-likelihood at p_true and p_hat
40 double ll_true = bernoulli_log_likelihood(x, p_true);
41 // Handle boundary: if p_hat is 0 or 1, nudge slightly to avoid log(0)
42 double p_eval = min(1.0 - 1e-12, max(1e-12, p_hat));
43 double ll_hat = bernoulli_log_likelihood(x, p_eval);
44
45 cout.setf(ios::fixed); cout << setprecision(6);
46 cout << "n = " << n << "\n";
47 cout << "True p = " << p_true << ", MLE p_hat = " << p_hat << "\n";
48 cout << "Log-likelihood at true p: " << ll_true << "\n";
49 cout << "Log-likelihood at MLE p: " << ll_hat << "\n";
50 return 0;
51}
52

This program estimates the Bernoulli success probability using the MLE p̂ = mean of 0/1 outcomes. It also computes the log-likelihood at the true p and the estimated p̂ to illustrate that the MLE maximizes the log-likelihood. Logarithms avoid underflow and make the objective smooth for optimization.

Time: O(n)Space: O(1)
Closed-form MLE for Univariate Gaussian (μ, σ^2) with Log-likelihood
1#include <bits/stdc++.h>
2using namespace std;
3
4struct GaussianMLE {
5 double mu;
6 double sigma2; // variance
7};
8
9GaussianMLE gaussian_mle(const vector<double>& x) {
10 int n = (int)x.size();
11 if (n == 0) return {0.0, 1.0};
12 // Compute mean
13 double sum = 0.0;
14 for (double v : x) sum += v;
15 double mu = sum / n;
16 // Compute MLE variance with 1/n (biased), not 1/(n-1)
17 double sse = 0.0;
18 for (double v : x) {
19 double d = v - mu;
20 sse += d * d;
21 }
22 double sigma2 = max(1e-30, sse / n); // enforce positivity
23 return {mu, sigma2};
24}
25
26// Log-likelihood for Normal with unknown mean and variance
27// ℓ(μ, σ^2) = -n/2 * log(2πσ^2) - (1/(2σ^2)) * Σ (x_i - μ)^2
28double gaussian_log_likelihood(const vector<double>& x, double mu, double sigma2) {
29 if (sigma2 <= 0.0) return -numeric_limits<double>::infinity();
30 int n = (int)x.size();
31 double sse = 0.0;
32 for (double v : x) {
33 double d = v - mu;
34 sse += d * d;
35 }
36 return -0.5 * n * log(2.0 * M_PI * sigma2) - 0.5 * sse / sigma2;
37}
38
39int main() {
40 // Generate synthetic Normal data with μ=2.0, σ^2=1.5
41 const int n = 2000;
42 const double mu_true = 2.0;
43 const double sigma2_true = 1.5;
44
45 std::mt19937 rng(123);
46 std::normal_distribution<double> norm(mu_true, sqrt(sigma2_true));
47
48 vector<double> x(n);
49 for (int i = 0; i < n; ++i) x[i] = norm(rng);
50
51 GaussianMLE est = gaussian_mle(x);
52
53 double ll_true = gaussian_log_likelihood(x, mu_true, sigma2_true);
54 double ll_hat = gaussian_log_likelihood(x, est.mu, est.sigma2);
55
56 cout.setf(ios::fixed); cout << setprecision(6);
57 cout << "n = " << n << "\n";
58 cout << "True mu = " << mu_true << ", True sigma2 = " << sigma2_true << "\n";
59 cout << "MLE mu = " << est.mu << ", MLE sigma2 = " << est.sigma2 << "\n";
60 cout << "Log-likelihood at true params: " << ll_true << "\n";
61 cout << "Log-likelihood at MLE params: " << ll_hat << "\n";
62 return 0;
63}
64

This program computes the MLEs for a univariate Normal distribution: the sample mean and the average squared deviation using 1/n. It also evaluates the log-likelihood at the true parameters and at the MLE, illustrating that the MLE gives the highest log-likelihood. The code enforces σ^2 > 0 for numerical stability.

Time: O(n)Space: O(1)
Numerical MLE: Binary Logistic Regression via Gradient Ascent
1#include <bits/stdc++.h>
2using namespace std;
3
4// Sigmoid with numerical stability
5double sigmoid(double z) {
6 if (z >= 0) {
7 double ez = exp(-z);
8 return 1.0 / (1.0 + ez);
9 } else {
10 double ez = exp(z);
11 return ez / (1.0 + ez);
12 }
13}
14
15// Compute log-likelihood and gradient for logistic regression
16// X: n x d design matrix (row-major as vector<vector<double>>)
17// y: n binary labels {0,1}
18// w: d parameters (including intercept if present)
19double logistic_loglik_and_grad(const vector<vector<double>>& X,
20 const vector<int>& y,
21 const vector<double>& w,
22 vector<double>& grad) {
23 int n = (int)X.size();
24 int d = (int)w.size();
25 fill(grad.begin(), grad.end(), 0.0);
26 double ll = 0.0;
27 for (int i = 0; i < n; ++i) {
28 // compute z = w^T x_i
29 double z = 0.0;
30 for (int j = 0; j < d; ++j) z += w[j] * X[i][j];
31 double p = sigmoid(z);
32 // log-likelihood contribution: y log p + (1-y) log(1-p)
33 // guard against log(0)
34 double eps = 1e-12;
35 p = min(1.0 - eps, max(eps, p));
36 ll += y[i] * log(p) + (1 - y[i]) * log(1.0 - p);
37 // gradient contribution: x_i * (y - p)
38 double r = (double)y[i] - p;
39 for (int j = 0; j < d; ++j) grad[j] += X[i][j] * r;
40 }
41 return ll;
42}
43
44int main() {
45 // Generate synthetic linearly separable-ish data but with noise
46 int n = 2000; // samples
47 int d = 3; // features including intercept
48
49 std::mt19937 rng(7);
50 std::normal_distribution<double> N01(0.0, 1.0);
51
52 // True weights (including intercept)
53 vector<double> w_true = { -0.5, 2.0, -1.0 };
54
55 // Build dataset X (with intercept column 1.0) and labels y
56 vector<vector<double>> X(n, vector<double>(d, 1.0));
57 vector<int> y(n);
58 for (int i = 0; i < n; ++i) {
59 double x1 = N01(rng);
60 double x2 = N01(rng) + 0.5; // small shift
61 X[i][0] = 1.0; // intercept
62 X[i][1] = x1;
63 X[i][2] = x2;
64 double z = w_true[0] * X[i][0] + w_true[1] * X[i][1] + w_true[2] * X[i][2];
65 double p = sigmoid(z);
66 std::bernoulli_distribution Bern(p);
67 y[i] = Bern(rng) ? 1 : 0;
68 }
69
70 // Initialize parameters to zeros
71 vector<double> w(d, 0.0);
72
73 // Gradient ascent with backtracking line search
74 double ll = -numeric_limits<double>::infinity();
75 double prev_ll = ll;
76 vector<double> grad(d, 0.0);
77
78 int max_iters = 500;
79 for (int t = 0; t < max_iters; ++t) {
80 ll = logistic_loglik_and_grad(X, y, w, grad);
81 if (t % 20 == 0) {
82 cout << "Iter " << t << ": log-likelihood = " << ll << "\n";
83 }
84 // Backtracking to ensure improvement
85 double step = 1.0; // initial step size
86 double c = 1e-4; // sufficient increase parameter
87 double ll_new;
88 vector<double> w_new(d);
89 // Try decreasing step sizes until improvement
90 for (int bt = 0; bt < 20; ++bt) {
91 for (int j = 0; j < d; ++j) w_new[j] = w[j] + step * grad[j];
92 vector<double> tmpg(d, 0.0);
93 ll_new = logistic_loglik_and_grad(X, y, w_new, tmpg);
94 if (ll_new >= ll + c * step * inner_product(grad.begin(), grad.end(), grad.begin(), 0.0)) {
95 break; // sufficient increase achieved
96 }
97 step *= 0.5;
98 }
99 // Update
100 w = w_new;
101 // Stopping based on small relative improvement
102 if (t > 5 && fabs(ll_new - ll) < 1e-6 * (1.0 + fabs(ll))) {
103 cout << "Converged at iteration " << t << "\n";
104 break;
105 }
106 }
107
108 // Evaluate training accuracy
109 int correct = 0;
110 for (int i = 0; i < n; ++i) {
111 double z = 0.0;
112 for (int j = 0; j < d; ++j) z += w[j] * X[i][j];
113 int yhat = sigmoid(z) >= 0.5 ? 1 : 0;
114 correct += (yhat == y[i]);
115 }
116
117 cout.setf(ios::fixed); cout << setprecision(6);
118 cout << "Estimated weights: ";
119 for (double wi : w) cout << wi << ' ';
120 cout << "\nTraining accuracy: " << (100.0 * correct / n) << "%\n";
121 return 0;
122}
123

This program fits a logistic regression model by maximizing the log-likelihood using gradient ascent with a simple backtracking line search. Each iteration computes the gradient in O(n d), proposes a step, and accepts it if the log-likelihood improves sufficiently. The sigmoid is implemented in a numerically stable way, and an intercept is included as the first feature.

Time: O(T n d) where T is the number of iterationsSpace: O(d)
#maximum likelihood#log-likelihood#bernoulli mle#gaussian mle#logistic regression#gradient ascent#newton raphson#fisher information#asymptotic normality#numerical optimization#argmax#score function#hessian#c++ mle#statistical inference