Maximum Likelihood & Generative Models
Key Points
- •Maximum Likelihood Estimation (MLE) picks parameters that make the observed data most probable under a chosen probabilistic model.
- •Generative models learn a full probability distribution (x) or a joint distribution (x, y), enabling sampling, likelihood scoring, and handling missing data.
- •Working with log-likelihoods avoids numerical underflow and turns products into sums, which are easier to optimize and analyze.
- •For Gaussian models, MLE has closed-form solutions: the sample mean and (population) variance minimize negative log-likelihood.
- •Naive Bayes is a generative classifier that assumes conditional independence of features given the class, making training very fast and surprisingly effective in high dimensions like text.
- •Mixture models, such as Gaussian Mixture Models (GMMs), model data as a combination of simpler distributions and are learned via the EM algorithm.
- •MLE can be viewed as minimizing the Kullback–Leibler divergence from the empirical data distribution to the model family.
- •Be careful with degenerate solutions (e.g., variances collapsing) and always use numerical safeguards like variance floors and log-sum-exp.
Prerequisites
- →Basic Probability Theory — Understanding random variables, independence, and probability distributions is essential for likelihood and generative models.
- →Calculus and Optimization Basics — Differentiation is needed to derive MLEs and understand optimization of log-likelihoods.
- →Linear Algebra — Vectors, matrices, and covariance are fundamental for multivariate Gaussian models.
- →Statistics: Mean, Variance, Covariance — MLE formulas for Gaussian models rely on these summary statistics.
- →Bayes' Rule — Generative classification computes posteriors from priors and likelihoods.
- →Numerical Stability (log-space arithmetic) — Practical implementations avoid underflow/overflow by working with log-probabilities.
Detailed Explanation
Tap terms for definitions01Overview
Maximum Likelihood Estimation (MLE) is a fundamental principle for fitting probabilistic models to data. Given a parametric family of probability distributions p_{\theta}, MLE chooses the parameter \theta that maximizes the probability (likelihood) of the observed sample. Generative modeling uses this idea to learn full probability distributions over data, p_{\theta}(x), or joint distributions p_{\theta}(x, y) in supervised contexts. Unlike discriminative methods that learn boundaries or conditional probabilities p(y \mid x), generative models attempt to capture how the data is produced, enabling sampling, anomaly detection, and handling of latent structure. In practice, we almost always maximize the log-likelihood, which simplifies products into sums and improves numerical stability. Many classic models have closed-form MLEs (e.g., Gaussian mean and variance), while others require iterative optimization like Expectation–Maximization (EM) for mixture models. Beyond fitting, generative models provide likelihood scores that quantify how well new observations match the learned distribution, a key property for tasks such as outlier detection. Importantly, MLE connects to information theory: it asymptotically finds the model that is closest to the true data-generating process within the model class, measured by Kullback–Leibler (KL) divergence.
02Intuition & Analogies
Imagine trying to guess the settings of a camera by only looking at the photos it took. If many photos are bright, you infer a higher exposure; if they are noisy, perhaps higher ISO. MLE formalizes this intuition: pick the settings (parameters) that would most likely produce the photos you observed. Another analogy: suppose you toss a possibly biased coin 100 times and see 70 heads. The MLE for the bias is 0.7—because among all possible biases, a 70% heads coin makes your exact observation the most plausible. Generative models expand this idea from coins to complex data like heights, sounds, or images: they try to learn a full density "landscape" over the space of possible observations. If you think of data points as pebbles dropped on a landscape, MLE adjusts the terrain so that pebbles tend to land on high ground (high probability regions). For mixtures, like a population with multiple subgroups, the landscape is made of several hills (components); the EM algorithm alternates between softly assigning pebbles to hills (E-step) and reshaping each hill to fit its assigned pebbles (M-step). Naive Bayes adds labels: for each class (each hill family), it models how features are produced, then uses Bayes' rule to pick the class whose hill explains the features best. Working in log-space is like switching from multiplying many small probabilities (which can underflow) to adding their logs, just like adding up evidence across data points.
03Formal Definition
04When to Use
Use MLE and generative models when you need a full probability distribution over data: to sample new examples, to score how typical a point is (anomaly detection), or to reason with missing or latent variables. They shine in low- to medium-dimensional problems where domain assumptions are credible: Gaussian models for sensor noise, Poisson for counts, or mixtures when data plausibly comes from subpopulations. In supervised settings with limited labeled data but lots of unlabeled data, modeling p(x) can help via semi-supervised or weakly supervised approaches. Naive Bayes is particularly effective in high-dimensional sparse domains like text (bag-of-words), where the conditional independence assumption is a good approximation. GMMs are suitable when clusters have roughly ellipsoidal shapes; using diagonal covariances scales better in high dimensions. Choose generative approaches when you want uncertainty-aware decisions (posterior probabilities), the ability to impute missing features, or interpretability of component distributions. Prefer discriminative methods (e.g., logistic regression, SVMs) when you only need p(y \mid x) or decision boundaries, especially in very high-dimensional settings with limited modeling assumptions, since they can achieve lower classification error even if they don’t model p(x).
⚠️Common Mistakes
• Multiplying tiny probabilities directly instead of summing log-probabilities, causing numerical underflow; always use logs and the log-sum-exp trick. • Forgetting variance/likelihood smoothing: with small data, MLE can drive variances toward zero (especially in mixtures), leading to degenerate solutions; use variance floors, regularization, or MAP with priors. • Misapplying independence assumptions in Naive Bayes: highly correlated features can distort likelihoods; consider feature selection, decorrelation, or switching to models with covariances. • Ignoring identifiability and label switching in mixtures: component labels are arbitrary; compare models via likelihood, not label indices. • Not checking convergence or initialization in EM: poor starts can lead to bad local optima; run multiple restarts and monitor log-likelihood monotonicity. • Using sample variance with 1/(n-1) when fitting Gaussian MLE; the MLE for variance uses 1/n—know which you need. • Evaluating densities with singular or near-singular covariances; use diagonal or regularized covariances (\Sigma + \lambda I). • Mixing probabilities and logs incorrectly in classifiers (e.g., adding probabilities to log-probabilities). • Data leakage: estimating parameters (e.g., priors, means, variances) on the full dataset including test data inflates performance.
Key Formulas
Maximum Likelihood
Explanation: The MLE chooses parameters that maximize the probability (or log-probability) of the observed data. The log version is preferred for numerical stability and simplicity.
Gaussian Log-Likelihood (Univariate)
Explanation: This is the log-likelihood for i.i.d. Gaussian samples with mean and variance . Maximizing it yields closed-form MLEs for and .
Gaussian MLE Estimates
Explanation: For i.i.d. Gaussian data, the MLE of the mean is the sample average and the MLE of the variance uses 1/n (population variance), not 1/(n-1).
Multivariate Gaussian Density
Explanation: This gives the probability density of a d-dimensional normal distribution. The determinant and inverse of the covariance control spread and orientation.
Mixture Model
Explanation: A mixture represents data as a weighted sum of component distributions. The are mixing weights that form a probability distribution.
EM E-step (Responsibilities)
Explanation: In the E-step, compute the posterior probability that each component generated each data point. These act as soft assignments.
EM M-step (1D GMM)
Explanation: Update component weights, means, and variances using responsibilities as fractional counts. This maximizes the expected complete-data log-likelihood.
Naive Bayes Decision Rule
Explanation: Under conditional independence, the class with the largest sum of log prior and log-conditional densities is predicted.
KL Divergence
Explanation: KL measures how different q is from p. MLE asymptotically minimizes KL between the true data distribution and the model.
NLL and Cross-Entropy
Explanation: The average negative log-likelihood equals the cross-entropy between the empirical distribution and the model minus the empirical entropy. Minimizing NLL minimizes cross-entropy.
MAP Estimation
Explanation: Maximum a posteriori estimation adds a log-prior to the log-likelihood, acting as regularization that can prevent degenerate MLEs.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct Gaussian1D { 5 double mu; 6 double sigma2; // variance 7 }; 8 9 // Compute MLE for mean and variance (population variance, 1/n) 10 Gaussian1D mle_gaussian_1d(const vector<double>& x) { 11 int n = (int)x.size(); 12 if (n == 0) throw runtime_error("Empty data"); 13 double sum = 0.0; 14 for (double v : x) sum += v; 15 double mu = sum / n; 16 double sq = 0.0; 17 for (double v : x) { 18 double d = v - mu; 19 sq += d * d; 20 } 21 double sigma2 = sq / n; // MLE uses 1/n (not 1/(n-1)) 22 return {mu, sigma2}; 23 } 24 25 // Log PDF of N(mu, sigma2) 26 double log_pdf_gaussian_1d(double x, double mu, double sigma2) { 27 const double two_pi = 2.0 * M_PI; 28 return -0.5 * log(two_pi * sigma2) - ( (x - mu) * (x - mu) ) / (2.0 * sigma2); 29 } 30 31 // Total log-likelihood of dataset under Gaussian(mu, sigma2) 32 double log_likelihood_gaussian_1d(const vector<double>& x, double mu, double sigma2) { 33 double ll = 0.0; 34 for (double v : x) ll += log_pdf_gaussian_1d(v, mu, sigma2); 35 return ll; 36 } 37 38 int main() { 39 // Generate synthetic data from N(3.0, 2.0^2) 40 std::mt19937 rng(42); 41 std::normal_distribution<double> dist(3.0, 2.0); 42 vector<double> data(1000); 43 for (double &v : data) v = dist(rng); 44 45 // Fit MLE 46 Gaussian1D g = mle_gaussian_1d(data); 47 48 // Compute log-likelihood under fitted params 49 double ll = log_likelihood_gaussian_1d(data, g.mu, g.sigma2); 50 51 cout << fixed << setprecision(4); 52 cout << "Estimated mu = " << g.mu << ", sigma^2 = " << g.sigma2 << "\n"; 53 cout << "Average NLL = " << (-ll / data.size()) << " (lower is better)\n"; 54 55 // Score a new point 56 double x_new = 0.0; 57 cout << "log p(x_new) = " << log_pdf_gaussian_1d(x_new, g.mu, g.sigma2) << "\n"; 58 return 0; 59 } 60
This program computes the MLE for a 1D Gaussian: the mean is the sample average and the variance uses 1/n. It then evaluates the log-likelihood of the data under the fitted model and demonstrates scoring a new point via its log density. Working in log-space avoids numerical underflow.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct GaussianNB { 5 int K = 0; // number of classes 6 int d = 0; // number of features 7 vector<double> log_prior; // log priors log pi_k 8 vector<vector<double>> mu; // K x d means 9 vector<vector<double>> var;// K x d variances (diagonal covariance) 10 double var_floor = 1e-6; // regularization to avoid zero variance 11 12 void fit(const vector<vector<double>>& X, const vector<int>& y) { 13 int n = (int)X.size(); 14 if (n == 0) throw runtime_error("Empty dataset"); 15 d = (int)X[0].size(); 16 K = 1 + *max_element(y.begin(), y.end()); 17 vector<int> count(K, 0); 18 mu.assign(K, vector<double>(d, 0.0)); 19 var.assign(K, vector<double>(d, 0.0)); 20 21 // Accumulate sums per class 22 for (int i = 0; i < n; ++i) { 23 int c = y[i]; 24 count[c]++; 25 for (int j = 0; j < d; ++j) mu[c][j] += X[i][j]; 26 } 27 for (int k = 0; k < K; ++k) { 28 if (count[k] == 0) throw runtime_error("Empty class encountered"); 29 for (int j = 0; j < d; ++j) mu[k][j] /= count[k]; 30 } 31 // Accumulate squared deviations for variances (MLE uses 1/n_c) 32 for (int i = 0; i < n; ++i) { 33 int c = y[i]; 34 for (int j = 0; j < d; ++j) { 35 double diff = X[i][j] - mu[c][j]; 36 var[c][j] += diff * diff; 37 } 38 } 39 for (int k = 0; k < K; ++k) { 40 for (int j = 0; j < d; ++j) { 41 var[k][j] = max(var[k][j] / count[k], var_floor); 42 } 43 } 44 // Priors with Laplace smoothing (optional: here uniform if zero counts are impossible) 45 log_prior.assign(K, 0.0); 46 for (int k = 0; k < K; ++k) { 47 double pi = (double)count[k] / n; // MLE prior 48 log_prior[k] = log(max(pi, 1e-12)); 49 } 50 } 51 52 // Log of univariate Gaussian density 53 static double log_gauss_1d(double x, double mu, double sigma2) { 54 return -0.5 * log(2.0 * M_PI * sigma2) - ( (x - mu) * (x - mu) ) / (2.0 * sigma2); 55 } 56 57 // Predict log posterior up to a constant: log pi_k + sum_j log N(x_j; mu_{k,j}, var_{k,j}) 58 vector<double> log_posterior_unnorm(const vector<double>& x) const { 59 vector<double> s(K, 0.0); 60 for (int k = 0; k < K; ++k) { 61 double acc = log_prior[k]; 62 for (int j = 0; j < d; ++j) acc += log_gauss_1d(x[j], mu[k][j], var[k][j]); 63 s[k] = acc; 64 } 65 return s; 66 } 67 68 int predict(const vector<double>& x) const { 69 vector<double> s = log_posterior_unnorm(x); 70 return (int)(max_element(s.begin(), s.end()) - s.begin()); 71 } 72 }; 73 74 // Utility: generate synthetic 2D data from two Gaussian classes 75 void make_toy_data(vector<vector<double>>& X, vector<int>& y) { 76 std::mt19937 rng(123); 77 int n_per_class = 200; 78 // Class 0 ~ N([0,0], I) 79 normal_distribution<double> n00(0.0, 1.0); 80 // Class 1 ~ N([3,3], I) 81 normal_distribution<double> n33(3.0, 1.0); 82 83 for (int i = 0; i < n_per_class; ++i) { 84 X.push_back({n00(rng), n00(rng)}); 85 y.push_back(0); 86 } 87 for (int i = 0; i < n_per_class; ++i) { 88 X.push_back({n33(rng), n33(rng)}); 89 y.push_back(1); 90 } 91 } 92 93 int main() { 94 vector<vector<double>> X; 95 vector<int> y; 96 make_toy_data(X, y); 97 98 GaussianNB nb; 99 nb.fit(X, y); 100 101 // Test a few points 102 vector<vector<double>> tests = {{0,0}, {2,2}, {4,4}, {-1,3}}; 103 for (const auto& x : tests) { 104 int pred = nb.predict(x); 105 cout << fixed << setprecision(3); 106 cout << "x = [" << x[0] << ", " << x[1] << "] -> class " << pred << "\n"; 107 } 108 return 0; 109 } 110
This implementation trains a Gaussian Naive Bayes classifier with diagonal covariances. Training computes class-wise means, variances (with a variance floor for stability), and priors. Prediction sums log prior and per-feature log Gaussian densities and picks the class with the highest score.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct GMM1D { 5 int K; 6 vector<double> pi; // mixing weights 7 vector<double> mu; // means 8 vector<double> sigma2; // variances 9 double var_floor = 1e-6; 10 11 explicit GMM1D(int K_) : K(K_), pi(K_, 1.0 / K_), mu(K_, 0.0), sigma2(K_, 1.0) {} 12 13 static double log_gauss_1d(double x, double m, double s2) { 14 return -0.5 * log(2.0 * M_PI * s2) - ( (x - m) * (x - m) ) / (2.0 * s2); 15 } 16 17 static double logsumexp(const vector<double>& a) { 18 double m = *max_element(a.begin(), a.end()); 19 double acc = 0.0; 20 for (double v : a) acc += exp(v - m); 21 return m + log(acc); 22 } 23 24 double log_likelihood(const vector<double>& x) const { 25 double ll = 0.0; 26 for (double xi : x) { 27 vector<double> lp(K); 28 for (int k = 0; k < K; ++k) lp[k] = log(max(pi[k], 1e-16)) + log_gauss_1d(xi, mu[k], sigma2[k]); 29 ll += logsumexp(lp); 30 } 31 return ll; 32 } 33 34 void initialize_params(const vector<double>& x) { 35 // Simple initialization: spread means across data range, uniform pi, shared variance 36 double mn = *min_element(x.begin(), x.end()); 37 double mx = *max_element(x.begin(), x.end()); 38 for (int k = 0; k < K; ++k) { 39 mu[k] = mn + (mx - mn) * (k + 0.5) / K; 40 pi[k] = 1.0 / K; 41 sigma2[k] = max(1e-2, 0.25 * (mx - mn) * (mx - mn)); 42 } 43 } 44 45 void fit(const vector<double>& x, int max_iter = 200, double tol = 1e-6) { 46 if (x.empty()) throw runtime_error("Empty data"); 47 initialize_params(x); 48 double prev_ll = -numeric_limits<double>::infinity(); 49 50 for (int it = 0; it < max_iter; ++it) { 51 // E-step (streaming responsibilities) and M-step accumulators 52 vector<double> Nk(K, 0.0), mu_num(K, 0.0), var_num(K, 0.0); 53 54 for (double xi : x) { 55 vector<double> lp(K); 56 for (int k = 0; k < K; ++k) { 57 lp[k] = log(max(pi[k], 1e-16)) + log_gauss_1d(xi, mu[k], sigma2[k]); 58 } 59 double lse = logsumexp(lp); 60 for (int k = 0; k < K; ++k) { 61 double rik = exp(lp[k] - lse); // responsibility 62 Nk[k] += rik; 63 mu_num[k] += rik * xi; 64 } 65 } 66 // Update means using Nk and mu_num 67 for (int k = 0; k < K; ++k) { 68 mu[k] = mu_num[k] / max(Nk[k], 1e-16); 69 } 70 // Recompute variances with new means 71 fill(var_num.begin(), var_num.end(), 0.0); 72 for (double xi : x) { 73 vector<double> lp(K); 74 for (int k = 0; k < K; ++k) lp[k] = log(max(pi[k], 1e-16)) + log_gauss_1d(xi, mu[k], sigma2[k]); 75 double lse = logsumexp(lp); 76 for (int k = 0; k < K; ++k) { 77 double rik = exp(lp[k] - lse); 78 double diff = xi - mu[k]; 79 var_num[k] += rik * diff * diff; 80 } 81 } 82 for (int k = 0; k < K; ++k) { 83 sigma2[k] = max(var_num[k] / max(Nk[k], 1e-16), var_floor); 84 pi[k] = Nk[k] / (double)x.size(); 85 } 86 87 double ll = log_likelihood(x); 88 // Check monotonicity and convergence 89 // EM should not decrease log-likelihood (up to numerical noise) 90 if (it > 0 && ll < prev_ll - 1e-8) cerr << "Warning: LL decreased slightly (" << prev_ll << " -> " << ll << ")\n"; 91 if (fabs(ll - prev_ll) < tol * (1.0 + fabs(prev_ll))) { 92 // Converged 93 break; 94 } 95 prev_ll = ll; 96 } 97 } 98 }; 99 100 int main() { 101 // Generate 1D data from a 2-component mixture 102 std::mt19937 rng(7); 103 normal_distribution<double> n1(-2.0, 0.7); // mean -2, std 0.7 104 normal_distribution<double> n2(3.0, 1.2); // mean 3, std 1.2 105 bernoulli_distribution choose_second(0.4); // mixing weight ~0.4 for component 2 106 107 vector<double> data; 108 for (int i = 0; i < 1000; ++i) { 109 if (choose_second(rng)) data.push_back(n2(rng)); 110 else data.push_back(n1(rng)); 111 } 112 113 GMM1D gmm(2); 114 gmm.fit(data, 200, 1e-6); 115 116 cout << fixed << setprecision(4); 117 cout << "Estimated parameters:\n"; 118 for (int k = 0; k < gmm.K; ++k) { 119 cout << " k=" << k << ": pi=" << gmm.pi[k] 120 << ", mu=" << gmm.mu[k] 121 << ", sigma2=" << gmm.sigma2[k] << "\n"; 122 } 123 cout << "Log-likelihood: " << gmm.log_likelihood(data) << "\n"; 124 return 0; 125 } 126
This code fits a 1D Gaussian Mixture Model using the EM algorithm. Responsibilities are computed in log-space via log-sum-exp for numerical stability, and parameters are updated using streaming accumulators to avoid storing an n×K matrix. Variance floors prevent degeneracy.