📚TheoryIntermediate

Maximum Likelihood & Generative Models

Key Points

•
Maximum Likelihood Estimation (MLE) picks parameters that make the observed data most probable under a chosen probabilistic model.
•
Generative models learn a full probability distribution $p_{θ}$ (x) or a joint distribution $p_{θ}$ (x, y), enabling sampling, likelihood scoring, and handling missing data.
•
Working with log-likelihoods avoids numerical underflow and turns products into sums, which are easier to optimize and analyze.
•
For Gaussian models, MLE has closed-form solutions: the sample mean and (population) variance minimize negative log-likelihood.
•
Naive Bayes is a generative classifier that assumes conditional independence of features given the class, making training very fast and surprisingly effective in high dimensions like text.
•
Mixture models, such as Gaussian Mixture Models (GMMs), model data as a combination of simpler distributions and are learned via the EM algorithm.
•
MLE can be viewed as minimizing the Kullback–Leibler divergence from the empirical data distribution to the model family.
•
Be careful with degenerate solutions (e.g., variances collapsing) and always use numerical safeguards like variance floors and log-sum-exp.

Prerequisites

→Basic Probability Theory — Understanding random variables, independence, and probability distributions is essential for likelihood and generative models.
→Calculus and Optimization Basics — Differentiation is needed to derive MLEs and understand optimization of log-likelihoods.
→Linear Algebra — Vectors, matrices, and covariance are fundamental for multivariate Gaussian models.
→Statistics: Mean, Variance, Covariance — MLE formulas for Gaussian models rely on these summary statistics.
→Bayes' Rule — Generative classification computes posteriors from priors and likelihoods.
→Numerical Stability (log-space arithmetic) — Practical implementations avoid underflow/overflow by working with log-probabilities.

Detailed Explanation

Tap terms for definitions

01Overview

Maximum Likelihood Estimation (MLE) is a fundamental principle for fitting probabilistic models to data. Given a parametric family of probability distributions $p_{\theta}$ , MLE chooses the parameter \theta that maximizes the probability (likelihood) of the observed sample. Generative modeling uses this idea to learn full probability distributions over data, $p_{\theta}$ (x), or joint distributions $p_{\theta}$ (x, y) in supervised contexts. Unlike discriminative methods that learn boundaries or conditional probabilities p(y \mid x), generative models attempt to capture how the data is produced, enabling sampling, anomaly detection, and handling of latent structure. In practice, we almost always maximize the log-likelihood, which simplifies products into sums and improves numerical stability. Many classic models have closed-form MLEs (e.g., Gaussian mean and variance), while others require iterative optimization like Expectation–Maximization (EM) for mixture models. Beyond fitting, generative models provide likelihood scores that quantify how well new observations match the learned distribution, a key property for tasks such as outlier detection. Importantly, MLE connects to information theory: it asymptotically finds the model that is closest to the true data-generating process within the model class, measured by Kullback–Leibler (KL) divergence.

02Intuition & Analogies

Imagine trying to guess the settings of a camera by only looking at the photos it took. If many photos are bright, you infer a higher exposure; if they are noisy, perhaps higher ISO. MLE formalizes this intuition: pick the settings (parameters) that would most likely produce the photos you observed. Another analogy: suppose you toss a possibly biased coin 100 times and see 70 heads. The MLE for the bias is 0.7—because among all possible biases, a 70% heads coin makes your exact observation the most plausible. Generative models expand this idea from coins to complex data like heights, sounds, or images: they try to learn a full density "landscape" over the space of possible observations. If you think of data points as pebbles dropped on a landscape, MLE adjusts the terrain so that pebbles tend to land on high ground (high probability regions). For mixtures, like a population with multiple subgroups, the landscape is made of several hills (components); the EM algorithm alternates between softly assigning pebbles to hills (E-step) and reshaping each hill to fit its assigned pebbles (M-step). Naive Bayes adds labels: for each class (each hill family), it models how features are produced, then uses Bayes' rule to pick the class whose hill explains the features best. Working in log-space is like switching from multiplying many small probabilities (which can underflow) to adding their logs, just like adding up evidence across data points.

03Formal Definition

Let D = \{

x_{1}

\dots

x_{n}

\} be i.i.d. samples from an unknown distribution. A parametric model family \{

p_{θ}

(x) :

θ

\in

Θ

\} defines a likelihood L(

θ

; D) =

\prod_{i = 1}^{n}

p_{θ}

(

x_{i}

). The Maximum Likelihood Estimator is

\hat{θ}_{MLE}

ar g max_{θ}

θ

; D) =

ar g max_{θ}

\sum_{i = 1}^{n}

lo g

p_{θ}

(

x_{i}

). Under regularity conditions,

\hat{θ}_{MLE}

is consistent, asymptotically normal, and efficient. In generative classification with classes y

\in

\{1,

\dots

, K\}, we model priors

π_{k}

= p(y=k) and class-conditional densities

p_{θ_{k}}

∣

y = k

). The joint is p(x,

y = k

) =

π_{k}

p_{θ_{k}}

∣

y = k

), and classification uses Bayes' rule: p(

y = k

∣

\propto

π_{k}

p_{θ_{k}}

∣

y = k

). Naive Bayes assumes conditional independence of features given y, so p(x

∣

y = k

) =

\prod_{j = 1}^{d}

x_{j}

∣

y = k;

θ_{k, j}

). For mixture models with a latent variable z

\in

\{1,

\dots

, K\}, p(x) =

\sum_{k = 1}^{K}

π_{k}

p_{θ_{k}}

∣

z = k

), with

\sum_{k}

π_{k}

= 1 and

π_{k}

\geq

0. When z is unobserved, MLE can be performed via EM, which iteratively maximizes a lower bound on the log-likelihood by alternating between computing responsibilities

r_{ik} = p (z = k

∣

x_{i}

θ

) (E-step) and maximizing parameters given these soft assignments (M-step).

04When to Use

Use MLE and generative models when you need a full probability distribution over data: to sample new examples, to score how typical a point is (anomaly detection), or to reason with missing or latent variables. They shine in low- to medium-dimensional problems where domain assumptions are credible: Gaussian models for sensor noise, Poisson for counts, or mixtures when data plausibly comes from subpopulations. In supervised settings with limited labeled data but lots of unlabeled data, modeling p(x) can help via semi-supervised or weakly supervised approaches. Naive Bayes is particularly effective in high-dimensional sparse domains like text (bag-of-words), where the conditional independence assumption is a good approximation. GMMs are suitable when clusters have roughly ellipsoidal shapes; using diagonal covariances scales better in high dimensions. Choose generative approaches when you want uncertainty-aware decisions (posterior probabilities), the ability to impute missing features, or interpretability of component distributions. Prefer discriminative methods (e.g., logistic regression, SVMs) when you only need p(y \mid x) or decision boundaries, especially in very high-dimensional settings with limited modeling assumptions, since they can achieve lower classification error even if they don’t model p(x).

⚠️Common Mistakes

• Multiplying tiny probabilities directly instead of summing log-probabilities, causing numerical underflow; always use logs and the log-sum-exp trick. • Forgetting variance/likelihood smoothing: with small data, MLE can drive variances toward zero (especially in mixtures), leading to degenerate solutions; use variance floors, regularization, or MAP with priors. • Misapplying independence assumptions in Naive Bayes: highly correlated features can distort likelihoods; consider feature selection, decorrelation, or switching to models with covariances. • Ignoring identifiability and label switching in mixtures: component labels are arbitrary; compare models via likelihood, not label indices. • Not checking convergence or initialization in EM: poor starts can lead to bad local optima; run multiple restarts and monitor log-likelihood monotonicity. • Using sample variance with 1/(n-1) when fitting Gaussian MLE; the MLE for variance uses 1/n—know which you need. • Evaluating densities with singular or near-singular covariances; use diagonal or regularized covariances (\Sigma + \lambda I). • Mixing probabilities and logs incorrectly in classifiers (e.g., adding probabilities to log-probabilities). • Data leakage: estimating parameters (e.g., priors, means, variances) on the full dataset including test data inflates performance.

Key Formulas

Maximum Likelihood

\hat{θ}_{MLE} = ar g θ max i = 1 \prod n p_{θ} (x_{i}) = ar g θ max i = 1 \sum n lo g p_{θ} (x_{i})

Explanation: The MLE chooses parameters that maximize the probability (or log-probability) of the observed data. The log version is preferred for numerical stability and simplicity.

Gaussian Log-Likelihood (Univariate)

ℓ (μ, σ^{2}) = - \frac{n}{2} lo g (2 π σ^{2}) - \frac{1}{2 σ ^{2}} i = 1 \sum n (x_{i} - μ)^{2}

Explanation: This is the log-likelihood for i.i.d. Gaussian samples with mean $μ$ and variance $σ^{2}$ . Maximizing it yields closed-form MLEs for $μ$ and $σ^{2}$ .

Gaussian MLE Estimates

\overset{μ}{^} = \frac{1}{n} i = 1 \sum n x_{i}, \overset{σ}{^}^{2} = \frac{1}{n} i = 1 \sum n (x_{i} - \overset{μ}{^})^{2}

Explanation: For i.i.d. Gaussian data, the MLE of the mean is the sample average and the MLE of the variance uses 1/n (population variance), not 1/(n-1).

Multivariate Gaussian Density

p (x) = \frac{1}{( 2 π ) ^{d /2} det ( Σ ) ^{1/2}} exp (- \frac{1}{2} (x - μ)^{⊤} Σ^{- 1} (x - μ))

Explanation: This gives the probability density of a d-dimensional normal distribution. The determinant and inverse of the covariance control spread and orientation.

Mixture Model

p (x) = k = 1 \sum K π_{k} p (x ∣ z = k), k = 1 \sum K π_{k} = 1, π_{k} \geq 0

Explanation: A mixture represents data as a weighted sum of component distributions. The $π_{k}$ are mixing weights that form a probability distribution.

EM E-step (Responsibilities)

r_{ik} = p (z = k ∣ x_{i}) = \frac{π _{k} p ( x _{i} ∣ θ _{k} )}{\sum _{j = 1}^{K} π _{j} p ( x _{i} ∣ θ _{j} )}

Explanation: In the E-step, compute the posterior probability that each component generated each data point. These act as soft assignments.

EM M-step (1D GMM)

N_{k} = i = 1 \sum n r_{ik}, μ_{k} = \frac{1}{N _{k}} i = 1 \sum n r_{ik} x_{i}, σ_{k}^{2} = \frac{1}{N _{k}} i = 1 \sum n r_{ik} (x_{i} - μ_{k})^{2}, π_{k} = \frac{N _{k}}{n}

Explanation: Update component weights, means, and variances using responsibilities as fractional counts. This maximizes the expected complete-data log-likelihood.

Naive Bayes Decision Rule

\overset{y}{^} (x) = ar g k max [lo g π_{k} + j = 1 \sum d lo g p (x_{j} ∣ y = k)]

Explanation: Under conditional independence, the class with the largest sum of log prior and log-conditional densities is predicted.

KL Divergence

D_{KL} (p ∥ q) = x \sum p (x) lo g \frac{p ( x )}{q ( x )} or \int p (x) lo g \frac{p ( x )}{q ( x )} d x

Explanation: KL measures how different q is from p. MLE asymptotically minimizes KL between the true data distribution and the model.

NLL and Cross-Entropy

- \frac{1}{n} i = 1 \sum n lo g p_{θ} (x_{i}) = H (\overset{p}{^}_{n}, p_{θ}) - H (\overset{p}{^}_{n})

Explanation: The average negative log-likelihood equals the cross-entropy between the empirical distribution and the model minus the empirical entropy. Minimizing NLL minimizes cross-entropy.

MAP Estimation

\hat{θ}_{MAP} = ar g θ max (i = 1 \sum n lo g p_{θ} (x_{i}) + lo g p (θ))

Explanation: Maximum a posteriori estimation adds a log-prior to the log-likelihood, acting as regularization that can prevent degenerate MLEs.

Complexity Analysis

For simple exponential-family models with closed-form MLEs (e.g., Gaussian with diagonal covariance), computing sufficient statistics is O(n d) time and O(d) space, where n is the number of samples and d the feature dimension. This includes summations for means and variances across features and classes in Naive Bayes; if there are K classes, training is O(n d) to accumulate class-wise statistics and O(K d) to finalize parameters. Prediction with Naive Bayes is O(K d) per example, as it sums per-feature log-densities for each class. For mixture models using EM, each iteration computes responsibilities for all n samples across K components and then updates parameters, yielding O(n K d) time per iteration; with T iterations, the total time is O(n K d T). Space depends on whether responsibilities are stored: a streaming implementation uses O(K d) space for accumulators, while storing the full responsibility matrix uses O(n K). Numerical stability measures like log-sum-exp add a constant-factor overhead but don’t change asymptotic complexity. If full covariance matrices are used in multivariate Gaussians, per-iteration time can increase due to O(

d^{3}

) operations for matrix inverses/determinants per component (though these are amortized if updated once per iteration). Regularization and variance floors add negligible cost. Overall, MLE for simple models is linear in the dataset size, while EM scales linearly in n and K per iteration but may require multiple restarts to escape poor local optima.

Code Examples

MLE for a Univariate Gaussian (mean and variance) with log-likelihood

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct Gaussian1D {
5     double mu;
6     double sigma2; // variance
7 };
8 
9 // Compute MLE for mean and variance (population variance, 1/n)
10 Gaussian1D mle_gaussian_1d(const vector<double>& x) {
11     int n = (int)x.size();
12     if (n == 0) throw runtime_error("Empty data");
13     double sum = 0.0;
14     for (double v : x) sum += v;
15     double mu = sum / n;
16     double sq = 0.0;
17     for (double v : x) {
18         double d = v - mu;
19         sq += d * d;
20     }
21     double sigma2 = sq / n; // MLE uses 1/n (not 1/(n-1))
22     return {mu, sigma2};
23 }
24 
25 // Log PDF of N(mu, sigma2)
26 double log_pdf_gaussian_1d(double x, double mu, double sigma2) {
27     const double two_pi = 2.0 * M_PI;
28     return -0.5 * log(two_pi * sigma2) - ( (x - mu) * (x - mu) ) / (2.0 * sigma2);
29 }
30 
31 // Total log-likelihood of dataset under Gaussian(mu, sigma2)
32 double log_likelihood_gaussian_1d(const vector<double>& x, double mu, double sigma2) {
33     double ll = 0.0;
34     for (double v : x) ll += log_pdf_gaussian_1d(v, mu, sigma2);
35     return ll;
36 }
37 
38 int main() {
39     // Generate synthetic data from N(3.0, 2.0^2)
40     std::mt19937 rng(42);
41     std::normal_distribution<double> dist(3.0, 2.0);
42     vector<double> data(1000);
43     for (double &v : data) v = dist(rng);
44 
45     // Fit MLE
46     Gaussian1D g = mle_gaussian_1d(data);
47 
48     // Compute log-likelihood under fitted params
49     double ll = log_likelihood_gaussian_1d(data, g.mu, g.sigma2);
50 
51     cout << fixed << setprecision(4);
52     cout << "Estimated mu = " << g.mu << ", sigma^2 = " << g.sigma2 << "\n";
53     cout << "Average NLL = " << (-ll / data.size()) << " (lower is better)\n";
54 
55     // Score a new point
56     double x_new = 0.0;
57     cout << "log p(x_new) = " << log_pdf_gaussian_1d(x_new, g.mu, g.sigma2) << "\n";
58     return 0;
59 }
60

This program computes the MLE for a 1D Gaussian: the mean is the sample average and the variance uses 1/n. It then evaluates the log-likelihood of the data under the fitted model and demonstrates scoring a new point via its log density. Working in log-space avoids numerical underflow.

Time: O(n)Space: O(1)

Gaussian Naive Bayes Classifier (diagonal Gaussian per class)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct GaussianNB {
5     int K = 0;                 // number of classes
6     int d = 0;                 // number of features
7     vector<double> log_prior;  // log priors log pi_k
8     vector<vector<double>> mu; // K x d means
9     vector<vector<double>> var;// K x d variances (diagonal covariance)
10     double var_floor = 1e-6;   // regularization to avoid zero variance
11 
12     void fit(const vector<vector<double>>& X, const vector<int>& y) {
13         int n = (int)X.size();
14         if (n == 0) throw runtime_error("Empty dataset");
15         d = (int)X[0].size();
16         K = 1 + *max_element(y.begin(), y.end());
17         vector<int> count(K, 0);
18         mu.assign(K, vector<double>(d, 0.0));
19         var.assign(K, vector<double>(d, 0.0));
20 
21         // Accumulate sums per class
22         for (int i = 0; i < n; ++i) {
23             int c = y[i];
24             count[c]++;
25             for (int j = 0; j < d; ++j) mu[c][j] += X[i][j];
26         }
27         for (int k = 0; k < K; ++k) {
28             if (count[k] == 0) throw runtime_error("Empty class encountered");
29             for (int j = 0; j < d; ++j) mu[k][j] /= count[k];
30         }
31         // Accumulate squared deviations for variances (MLE uses 1/n_c)
32         for (int i = 0; i < n; ++i) {
33             int c = y[i];
34             for (int j = 0; j < d; ++j) {
35                 double diff = X[i][j] - mu[c][j];
36                 var[c][j] += diff * diff;
37             }
38         }
39         for (int k = 0; k < K; ++k) {
40             for (int j = 0; j < d; ++j) {
41                 var[k][j] = max(var[k][j] / count[k], var_floor);
42             }
43         }
44         // Priors with Laplace smoothing (optional: here uniform if zero counts are impossible)
45         log_prior.assign(K, 0.0);
46         for (int k = 0; k < K; ++k) {
47             double pi = (double)count[k] / n; // MLE prior
48             log_prior[k] = log(max(pi, 1e-12));
49         }
50     }
51 
52     // Log of univariate Gaussian density
53     static double log_gauss_1d(double x, double mu, double sigma2) {
54         return -0.5 * log(2.0 * M_PI * sigma2) - ( (x - mu) * (x - mu) ) / (2.0 * sigma2);
55     }
56 
57     // Predict log posterior up to a constant: log pi_k + sum_j log N(x_j; mu_{k,j}, var_{k,j})
58     vector<double> log_posterior_unnorm(const vector<double>& x) const {
59         vector<double> s(K, 0.0);
60         for (int k = 0; k < K; ++k) {
61             double acc = log_prior[k];
62             for (int j = 0; j < d; ++j) acc += log_gauss_1d(x[j], mu[k][j], var[k][j]);
63             s[k] = acc;
64         }
65         return s;
66     }
67 
68     int predict(const vector<double>& x) const {
69         vector<double> s = log_posterior_unnorm(x);
70         return (int)(max_element(s.begin(), s.end()) - s.begin());
71     }
72 };
73 
74 // Utility: generate synthetic 2D data from two Gaussian classes
75 void make_toy_data(vector<vector<double>>& X, vector<int>& y) {
76     std::mt19937 rng(123);
77     int n_per_class = 200;
78     // Class 0 ~ N([0,0], I)
79     normal_distribution<double> n00(0.0, 1.0);
80     // Class 1 ~ N([3,3], I)
81     normal_distribution<double> n33(3.0, 1.0);
82 
83     for (int i = 0; i < n_per_class; ++i) {
84         X.push_back({n00(rng), n00(rng)});
85         y.push_back(0);
86     }
87     for (int i = 0; i < n_per_class; ++i) {
88         X.push_back({n33(rng), n33(rng)});
89         y.push_back(1);
90     }
91 }
92 
93 int main() {
94     vector<vector<double>> X;
95     vector<int> y;
96     make_toy_data(X, y);
97 
98     GaussianNB nb;
99     nb.fit(X, y);
100 
101     // Test a few points
102     vector<vector<double>> tests = {{0,0}, {2,2}, {4,4}, {-1,3}};
103     for (const auto& x : tests) {
104         int pred = nb.predict(x);
105         cout << fixed << setprecision(3);
106         cout << "x = [" << x[0] << ", " << x[1] << "] -> class " << pred << "\n";
107     }
108     return 0;
109 }
110

This implementation trains a Gaussian Naive Bayes classifier with diagonal covariances. Training computes class-wise means, variances (with a variance floor for stability), and priors. Prediction sums log prior and per-feature log Gaussian densities and picks the class with the highest score.

Time: Training: O(n d) to accumulate statistics; Prediction per sample: O(K d).Space: O(K d) for parameters.

EM for a 1D Gaussian Mixture Model (streaming responsibilities, log-sum-exp)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct GMM1D {
5     int K;
6     vector<double> pi;     // mixing weights
7     vector<double> mu;     // means
8     vector<double> sigma2; // variances
9     double var_floor = 1e-6;
10 
11     explicit GMM1D(int K_) : K(K_), pi(K_, 1.0 / K_), mu(K_, 0.0), sigma2(K_, 1.0) {}
12 
13     static double log_gauss_1d(double x, double m, double s2) {
14         return -0.5 * log(2.0 * M_PI * s2) - ( (x - m) * (x - m) ) / (2.0 * s2);
15     }
16 
17     static double logsumexp(const vector<double>& a) {
18         double m = *max_element(a.begin(), a.end());
19         double acc = 0.0;
20         for (double v : a) acc += exp(v - m);
21         return m + log(acc);
22     }
23 
24     double log_likelihood(const vector<double>& x) const {
25         double ll = 0.0;
26         for (double xi : x) {
27             vector<double> lp(K);
28             for (int k = 0; k < K; ++k) lp[k] = log(max(pi[k], 1e-16)) + log_gauss_1d(xi, mu[k], sigma2[k]);
29             ll += logsumexp(lp);
30         }
31         return ll;
32     }
33 
34     void initialize_params(const vector<double>& x) {
35         // Simple initialization: spread means across data range, uniform pi, shared variance
36         double mn = *min_element(x.begin(), x.end());
37         double mx = *max_element(x.begin(), x.end());
38         for (int k = 0; k < K; ++k) {
39             mu[k] = mn + (mx - mn) * (k + 0.5) / K;
40             pi[k] = 1.0 / K;
41             sigma2[k] = max(1e-2, 0.25 * (mx - mn) * (mx - mn));
42         }
43     }
44 
45     void fit(const vector<double>& x, int max_iter = 200, double tol = 1e-6) {
46         if (x.empty()) throw runtime_error("Empty data");
47         initialize_params(x);
48         double prev_ll = -numeric_limits<double>::infinity();
49 
50         for (int it = 0; it < max_iter; ++it) {
51             // E-step (streaming responsibilities) and M-step accumulators
52             vector<double> Nk(K, 0.0), mu_num(K, 0.0), var_num(K, 0.0);
53 
54             for (double xi : x) {
55                 vector<double> lp(K);
56                 for (int k = 0; k < K; ++k) {
57                     lp[k] = log(max(pi[k], 1e-16)) + log_gauss_1d(xi, mu[k], sigma2[k]);
58                 }
59                 double lse = logsumexp(lp);
60                 for (int k = 0; k < K; ++k) {
61                     double rik = exp(lp[k] - lse); // responsibility
62                     Nk[k] += rik;
63                     mu_num[k] += rik * xi;
64                 }
65             }
66             // Update means using Nk and mu_num
67             for (int k = 0; k < K; ++k) {
68                 mu[k] = mu_num[k] / max(Nk[k], 1e-16);
69             }
70             // Recompute variances with new means
71             fill(var_num.begin(), var_num.end(), 0.0);
72             for (double xi : x) {
73                 vector<double> lp(K);
74                 for (int k = 0; k < K; ++k) lp[k] = log(max(pi[k], 1e-16)) + log_gauss_1d(xi, mu[k], sigma2[k]);
75                 double lse = logsumexp(lp);
76                 for (int k = 0; k < K; ++k) {
77                     double rik = exp(lp[k] - lse);
78                     double diff = xi - mu[k];
79                     var_num[k] += rik * diff * diff;
80                 }
81             }
82             for (int k = 0; k < K; ++k) {
83                 sigma2[k] = max(var_num[k] / max(Nk[k], 1e-16), var_floor);
84                 pi[k] = Nk[k] / (double)x.size();
85             }
86 
87             double ll = log_likelihood(x);
88             // Check monotonicity and convergence
89             // EM should not decrease log-likelihood (up to numerical noise)
90             if (it > 0 && ll < prev_ll - 1e-8) cerr << "Warning: LL decreased slightly (" << prev_ll << " -> " << ll << ")\n";
91             if (fabs(ll - prev_ll) < tol * (1.0 + fabs(prev_ll))) {
92                 // Converged
93                 break;
94             }
95             prev_ll = ll;
96         }
97     }
98 };
99 
100 int main() {
101     // Generate 1D data from a 2-component mixture
102     std::mt19937 rng(7);
103     normal_distribution<double> n1(-2.0, 0.7); // mean -2, std 0.7
104     normal_distribution<double> n2(3.0, 1.2);  // mean 3, std 1.2
105     bernoulli_distribution choose_second(0.4); // mixing weight ~0.4 for component 2
106 
107     vector<double> data;
108     for (int i = 0; i < 1000; ++i) {
109         if (choose_second(rng)) data.push_back(n2(rng));
110         else data.push_back(n1(rng));
111     }
112 
113     GMM1D gmm(2);
114     gmm.fit(data, 200, 1e-6);
115 
116     cout << fixed << setprecision(4);
117     cout << "Estimated parameters:\n";
118     for (int k = 0; k < gmm.K; ++k) {
119         cout << "  k=" << k << ": pi=" << gmm.pi[k]
120              << ", mu=" << gmm.mu[k]
121              << ", sigma2=" << gmm.sigma2[k] << "\n";
122     }
123     cout << "Log-likelihood: " << gmm.log_likelihood(data) << "\n";
124     return 0;
125 }
126

This code fits a 1D Gaussian Mixture Model using the EM algorithm. Responsibilities are computed in log-space via log-sum-exp for numerical stability, and parameters are updated using streaming accumulators to avoid storing an n×K matrix. Variance floors prevent degeneracy.

Time: O(n K T), where T is the number of EM iterations.Space: O(K) for parameters and accumulators (streaming implementation).

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct Gaussian1D {
5	double mu;
6	double sigma2; // variance
7	};
8
9	// Compute MLE for mean and variance (population variance, 1/n)
10	Gaussian1D mle_gaussian_1d(const vector<double>& x) {
11	int n = (int)x.size();
12	if (n == 0) throw runtime_error("Empty data");
13	double sum = 0.0;
14	for (double v : x) sum += v;
15	double mu = sum / n;
16	double sq = 0.0;
17	for (double v : x) {
18	double d = v - mu;
19	sq += d * d;
20	}
21	double sigma2 = sq / n; // MLE uses 1/n (not 1/(n-1))
22	return {mu, sigma2};
23	}
24
25	// Log PDF of N(mu, sigma2)
26	double log_pdf_gaussian_1d(double x, double mu, double sigma2) {
27	const double two_pi = 2.0 * M_PI;
28	return -0.5 * log(two_pi * sigma2) - ( (x - mu) * (x - mu) ) / (2.0 * sigma2);
29	}
30
31	// Total log-likelihood of dataset under Gaussian(mu, sigma2)
32	double log_likelihood_gaussian_1d(const vector<double>& x, double mu, double sigma2) {
33	double ll = 0.0;
34	for (double v : x) ll += log_pdf_gaussian_1d(v, mu, sigma2);
35	return ll;
36	}
37
38	int main() {
39	// Generate synthetic data from N(3.0, 2.0^2)
40	std::mt19937 rng(42);
41	std::normal_distribution<double> dist(3.0, 2.0);
42	vector<double> data(1000);
43	for (double &v : data) v = dist(rng);
44
45	// Fit MLE
46	Gaussian1D g = mle_gaussian_1d(data);
47
48	// Compute log-likelihood under fitted params
49	double ll = log_likelihood_gaussian_1d(data, g.mu, g.sigma2);
50
51	cout << fixed << setprecision(4);
52	cout << "Estimated mu = " << g.mu << ", sigma^2 = " << g.sigma2 << "\n";
53	cout << "Average NLL = " << (-ll / data.size()) << " (lower is better)\n";
54
55	// Score a new point
56	double x_new = 0.0;
57	cout << "log p(x_new) = " << log_pdf_gaussian_1d(x_new, g.mu, g.sigma2) << "\n";
58	return 0;
59	}
60

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct GaussianNB {
5	int K = 0; // number of classes
6	int d = 0; // number of features
7	vector<double> log_prior; // log priors log pi_k
8	vector<vector<double>> mu; // K x d means
9	vector<vector<double>> var;// K x d variances (diagonal covariance)
10	double var_floor = 1e-6; // regularization to avoid zero variance
11
12	void fit(const vector<vector<double>>& X, const vector<int>& y) {
13	int n = (int)X.size();
14	if (n == 0) throw runtime_error("Empty dataset");
15	d = (int)X[0].size();
16	K = 1 + *max_element(y.begin(), y.end());
17	vector<int> count(K, 0);
18	mu.assign(K, vector<double>(d, 0.0));
19	var.assign(K, vector<double>(d, 0.0));
20
21	// Accumulate sums per class
22	for (int i = 0; i < n; ++i) {
23	int c = y[i];
24	count[c]++;
25	for (int j = 0; j < d; ++j) mu[c][j] += X[i][j];
26	}
27	for (int k = 0; k < K; ++k) {
28	if (count[k] == 0) throw runtime_error("Empty class encountered");
29	for (int j = 0; j < d; ++j) mu[k][j] /= count[k];
30	}
31	// Accumulate squared deviations for variances (MLE uses 1/n_c)
32	for (int i = 0; i < n; ++i) {
33	int c = y[i];
34	for (int j = 0; j < d; ++j) {
35	double diff = X[i][j] - mu[c][j];
36	var[c][j] += diff * diff;
37	}
38	}
39	for (int k = 0; k < K; ++k) {
40	for (int j = 0; j < d; ++j) {
41	var[k][j] = max(var[k][j] / count[k], var_floor);
42	}
43	}
44	// Priors with Laplace smoothing (optional: here uniform if zero counts are impossible)
45	log_prior.assign(K, 0.0);
46	for (int k = 0; k < K; ++k) {
47	double pi = (double)count[k] / n; // MLE prior
48	log_prior[k] = log(max(pi, 1e-12));
49	}
50	}
51
52	// Log of univariate Gaussian density
53	static double log_gauss_1d(double x, double mu, double sigma2) {
54	return -0.5 * log(2.0 * M_PI * sigma2) - ( (x - mu) * (x - mu) ) / (2.0 * sigma2);
55	}
56
57	// Predict log posterior up to a constant: log pi_k + sum_j log N(x_j; mu_{k,j}, var_{k,j})
58	vector<double> log_posterior_unnorm(const vector<double>& x) const {
59	vector<double> s(K, 0.0);
60	for (int k = 0; k < K; ++k) {
61	double acc = log_prior[k];
62	for (int j = 0; j < d; ++j) acc += log_gauss_1d(x[j], mu[k][j], var[k][j]);
63	s[k] = acc;
64	}
65	return s;
66	}
67
68	int predict(const vector<double>& x) const {
69	vector<double> s = log_posterior_unnorm(x);
70	return (int)(max_element(s.begin(), s.end()) - s.begin());
71	}
72	};
73
74	// Utility: generate synthetic 2D data from two Gaussian classes
75	void make_toy_data(vector<vector<double>>& X, vector<int>& y) {
76	std::mt19937 rng(123);
77	int n_per_class = 200;
78	// Class 0 ~ N([0,0], I)
79	normal_distribution<double> n00(0.0, 1.0);
80	// Class 1 ~ N([3,3], I)
81	normal_distribution<double> n33(3.0, 1.0);
82
83	for (int i = 0; i < n_per_class; ++i) {
84	X.push_back({n00(rng), n00(rng)});
85	y.push_back(0);
86	}
87	for (int i = 0; i < n_per_class; ++i) {
88	X.push_back({n33(rng), n33(rng)});
89	y.push_back(1);
90	}
91	}
92
93	int main() {
94	vector<vector<double>> X;
95	vector<int> y;
96	make_toy_data(X, y);
97
98	GaussianNB nb;
99	nb.fit(X, y);
100
101	// Test a few points
102	vector<vector<double>> tests = {{0,0}, {2,2}, {4,4}, {-1,3}};
103	for (const auto& x : tests) {
104	int pred = nb.predict(x);
105	cout << fixed << setprecision(3);
106	cout << "x = [" << x[0] << ", " << x[1] << "] -> class " << pred << "\n";
107	}
108	return 0;
109	}
110

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct GMM1D {
5	int K;
6	vector<double> pi; // mixing weights
7	vector<double> mu; // means
8	vector<double> sigma2; // variances
9	double var_floor = 1e-6;
10
11	explicit GMM1D(int K_) : K(K_), pi(K_, 1.0 / K_), mu(K_, 0.0), sigma2(K_, 1.0) {}
12
13	static double log_gauss_1d(double x, double m, double s2) {
14	return -0.5 * log(2.0 * M_PI * s2) - ( (x - m) * (x - m) ) / (2.0 * s2);
15	}
16
17	static double logsumexp(const vector<double>& a) {
18	double m = *max_element(a.begin(), a.end());
19	double acc = 0.0;
20	for (double v : a) acc += exp(v - m);
21	return m + log(acc);
22	}
23
24	double log_likelihood(const vector<double>& x) const {
25	double ll = 0.0;
26	for (double xi : x) {
27	vector<double> lp(K);
28	for (int k = 0; k < K; ++k) lp[k] = log(max(pi[k], 1e-16)) + log_gauss_1d(xi, mu[k], sigma2[k]);
29	ll += logsumexp(lp);
30	}
31	return ll;
32	}
33
34	void initialize_params(const vector<double>& x) {
35	// Simple initialization: spread means across data range, uniform pi, shared variance
36	double mn = *min_element(x.begin(), x.end());
37	double mx = *max_element(x.begin(), x.end());
38	for (int k = 0; k < K; ++k) {
39	mu[k] = mn + (mx - mn) * (k + 0.5) / K;
40	pi[k] = 1.0 / K;
41	sigma2[k] = max(1e-2, 0.25 * (mx - mn) * (mx - mn));
42	}
43	}
44
45	void fit(const vector<double>& x, int max_iter = 200, double tol = 1e-6) {
46	if (x.empty()) throw runtime_error("Empty data");
47	initialize_params(x);
48	double prev_ll = -numeric_limits<double>::infinity();
49
50	for (int it = 0; it < max_iter; ++it) {
51	// E-step (streaming responsibilities) and M-step accumulators
52	vector<double> Nk(K, 0.0), mu_num(K, 0.0), var_num(K, 0.0);
53
54	for (double xi : x) {
55	vector<double> lp(K);
56	for (int k = 0; k < K; ++k) {
57	lp[k] = log(max(pi[k], 1e-16)) + log_gauss_1d(xi, mu[k], sigma2[k]);
58	}
59	double lse = logsumexp(lp);
60	for (int k = 0; k < K; ++k) {
61	double rik = exp(lp[k] - lse); // responsibility
62	Nk[k] += rik;
63	mu_num[k] += rik * xi;
64	}
65	}
66	// Update means using Nk and mu_num
67	for (int k = 0; k < K; ++k) {
68	mu[k] = mu_num[k] / max(Nk[k], 1e-16);
69	}
70	// Recompute variances with new means
71	fill(var_num.begin(), var_num.end(), 0.0);
72	for (double xi : x) {
73	vector<double> lp(K);
74	for (int k = 0; k < K; ++k) lp[k] = log(max(pi[k], 1e-16)) + log_gauss_1d(xi, mu[k], sigma2[k]);
75	double lse = logsumexp(lp);
76	for (int k = 0; k < K; ++k) {
77	double rik = exp(lp[k] - lse);
78	double diff = xi - mu[k];
79	var_num[k] += rik * diff * diff;
80	}
81	}
82	for (int k = 0; k < K; ++k) {
83	sigma2[k] = max(var_num[k] / max(Nk[k], 1e-16), var_floor);
84	pi[k] = Nk[k] / (double)x.size();
85	}
86
87	double ll = log_likelihood(x);
88	// Check monotonicity and convergence
89	// EM should not decrease log-likelihood (up to numerical noise)
90	if (it > 0 && ll < prev_ll - 1e-8) cerr << "Warning: LL decreased slightly (" << prev_ll << " -> " << ll << ")\n";
91	if (fabs(ll - prev_ll) < tol * (1.0 + fabs(prev_ll))) {
92	// Converged
93	break;
94	}
95	prev_ll = ll;
96	}
97	}
98	};
99
100	int main() {
101	// Generate 1D data from a 2-component mixture
102	std::mt19937 rng(7);
103	normal_distribution<double> n1(-2.0, 0.7); // mean -2, std 0.7
104	normal_distribution<double> n2(3.0, 1.2); // mean 3, std 1.2
105	bernoulli_distribution choose_second(0.4); // mixing weight ~0.4 for component 2
106
107	vector<double> data;
108	for (int i = 0; i < 1000; ++i) {
109	if (choose_second(rng)) data.push_back(n2(rng));
110	else data.push_back(n1(rng));
111	}
112
113	GMM1D gmm(2);
114	gmm.fit(data, 200, 1e-6);
115
116	cout << fixed << setprecision(4);
117	cout << "Estimated parameters:\n";
118	for (int k = 0; k < gmm.K; ++k) {
119	cout << " k=" << k << ": pi=" << gmm.pi[k]
120	<< ", mu=" << gmm.mu[k]
121	<< ", sigma2=" << gmm.sigma2[k] << "\n";
122	}
123	cout << "Log-likelihood: " << gmm.log_likelihood(data) << "\n";
124	return 0;
125	}
126