📚TheoryIntermediate

Probability Distributions

Key Points

•
Probability distributions describe how random outcomes are spread across possible values and let us compute probabilities, expectations, and uncertainties.
•
Discrete distributions (e.g., Bernoulli, Binomial, Poisson) count events; continuous ones (e.g., Gaussian, Exponential, Gamma, Beta, Uniform) model real-valued measurements.
•
Key tools include PMF/PDF (probability mass/density), CDF (cumulative), moments $\frac{m e an}{v a r ian ce}$ , and MGF (moment-generating function).
•
Many common distributions belong to the exponential family, which yields convenient sufficient statistics and simplifies estimation.
•
Transformations (e.g., Box–Muller) map simple random variables into complex ones using Jacobians to adjust densities.
•
Mixture models combine several component distributions to fit multimodal data and are learned with algorithms like EM.
•
In AI/ML, Gaussian assumptions power regression, categorical distributions power classification, and Poisson models count data.
•
C++’s <random> library supports efficient sampling; you can also implement custom samplers and estimators for learning.

Prerequisites

→Probability axioms and conditional probability — Distributions build on the axioms and require understanding of events, independence, and conditional probabilities.
→Single-variable calculus (integration and differentiation) — PDFs integrate to 1, CDFs are integrals of PDFs, and MGFs use expectations of exponentials.
→Linear algebra (vectors, matrices, determinant) — Multivariate Gaussians involve covariance matrices, determinants, and quadratic forms.
→Numerical stability and floating-point arithmetic — Stable evaluation of log-likelihoods and combinatorial terms avoids overflow/underflow.
→C++ random number generation (<random>) — Practical sampling and simulation rely on engines and distribution objects in C++.

Detailed Explanation

Tap terms for definitions

01Overview

A probability distribution is a mathematical description of how likely different outcomes are. For discrete outcomes (like coin flips), we use a probability mass function (PMF) that assigns probabilities to each possible value. For continuous outcomes (like heights), we use a probability density function (PDF) whose area under the curve over an interval gives the probability of landing in that interval. The cumulative distribution function (CDF) tells us the probability that a random variable is less than or equal to a value. Key numeric summaries, called moments (like mean and variance), describe the center and spread. The moment generating function (MGF) encodes all moments and can simplify proofs and derivations. Distributions come in families suited to different data: Bernoulli and Binomial for yes/no counts, Poisson for rare events per time window, Gaussian for aggregated noise, Exponential and Gamma for waiting times, Beta and Dirichlet for probabilities themselves, and Uniform when everything in a range is equally likely. Many of these fit into the exponential family, which yields elegant properties like sufficient statistics. In machine learning, we often assume data follow certain distributions to build models, estimate parameters from data (e.g., via maximum likelihood), and make predictions. Mixture models combine several distributions to capture complex, multimodal patterns.

02Intuition & Analogies

Imagine tossing a handful of dice. Before rolling, you don’t know the totals you’ll see, but you have a sense of which are common (like 7) and which are rare (like 2 or 12). A probability distribution is that “sense” written as math: it tells you, for every possible outcome, how plausible it is. For discrete outcomes, it’s like a weighted checklist: each item (outcome) has a weight (probability) and the weights add to 1. For continuous outcomes, think of a smooth hill: the hill’s height (density) is not a probability by itself, but the area under the hill over an interval is. Different everyday processes create characteristic shapes. Flipping a coin is Bernoulli; repeating it n times gives a Binomial that looks bell-shaped when n is large. Counting rare events (like daily website outages) fits a Poisson curve. Waiting for the next bus is Exponential if arrivals are memoryless. Measurements with many tiny, independent errors tend to look Gaussian because all the little pushes and pulls add up to a bell curve. When you’re unsure about a probability itself (e.g., “what’s this click-through rate?”), Beta and Dirichlet act like flexible knobs to encode your prior belief. Sometimes one simple hill isn’t enough—think of heights of adults in a crowd containing both children and grownups: two bumps. Mixture models let you add hills (components) together. And if you only know how to sample uniform randomness, transformations (like Box–Muller) are recipes to reshape it into other distributions by stretching and bending, carefully adjusting areas so total probability stays 1.

03Formal Definition

A probability distribution specifies a random variable X by giving its PMF p(x) for discrete X or PDF f(x) for continuous X, with normalization \(

\sum_{x}

p(x) = 1\) or \(

\int

f(x)\,

dx = 1\

) over the support. The CDF is \(F(x) =

P

\leq

x)\). The k-th moment is \(

E

[

X^{k}

]\); the mean is \(

μ

E

[X]\); the variance is \(

σ^{2}

E

[(X-

μ

)^2]\). The moment-generating function (MGF) is \(

M_{X}

(t) =

E

[

e^{tX}

]\) when it exists and uniquely characterizes many distributions. An exponential-family distribution has density/mass of the form \(p(x

∣ θ

) = h(x)

exp

(

η

(

θ

)^{

⊤

} T(x) - A(

θ

))\), where $T(x)$ are sufficient statistics, \(

η

(

θ

)\) is the natural parameter, and \(A(

θ

)\) ensures normalization. Sufficient statistics $T(x)$ summarize data so that the likelihood of parameters depends on the data only through \(

\sum_{i}

x_{i}

)\). A mixture model combines components: \(p(x) =

\sum_{k = 1}^{K}

π_{k}

p_{k}

(x)\), with mixing weights \(

π_{k}

\geq

0\), \(

\sum_{k}

π_{k}

= 1\). Under variable transformations \(

Y = g (X

)\) (invertible and differentiable), the density changes via \(

f_{Y}

(y) =

f_{X}

(

g^{- 1}

(y))\,

∣ det

J_{

g^{- 1}

}(y)

∣

\), where $J$ is the Jacobian.

04When to Use

Bernoulli/Binomial: Binary outcomes and counts of successes (A/B tests, clicks). Use when each trial is independent with constant success probability.
Poisson: Count events in fixed intervals when events occur independently and rarely (call arrivals, defects). Often paired with Exponential for inter-arrival times.
Gaussian: When many independent small effects add (sensor noise, regression residuals). The Central Limit Theorem justifies Gaussian approximations.
Exponential/Gamma: Model waiting/processing times; Exponential if memoryless, Gamma if you sum multiple exponential stages.
Beta/Dirichlet: Uncertainty over probabilities (priors/posteriors in Bayesian models for Bernoulli/Categorical or Multinomial data).
Uniform: When everything in [a, b] is equally plausible; also as a building block for other samplers.
Multinomial/Categorical: Multi-class outcomes and counts (word categories, class labels).
Multivariate Gaussian: Continuous vectors with linear correlations (Kalman filters, Gaussian processes approximations).
Mixture models (e.g., Gaussian Mixtures): Data with multiple clusters/modes; unsupervised learning and density estimation.
Transformations: When you only have uniform randomness or need to reparameterize variables; used in simulation, normalizing flows, and variational inference.

⚠️Common Mistakes

Confusing PDF/PMF with probability: For continuous variables, f(x) can exceed 1; only integrals over intervals are probabilities.
Using the wrong parameterization: E.g., mixing up Exponential rate (\lambda) with scale (\beta = 1/\lambda), or Beta(\alpha,\beta) vs. mean/variance forms. Always check definitions.
Ignoring support: Evaluating a PDF outside its support (e.g., negative values for an Exponential) or forgetting that Categorical probabilities must sum to 1 leads to invalid models.
Violating independence/identical distribution assumptions: Binomial and Poisson models break if trials or events are dependent or rates vary widely over time.
Numeric instability: Directly computing factorials or binomial coefficients can overflow. Use logs or stable functions (like lgamma) for large n.
Overfitting with mixtures: Letting too many components fit noise; mitigate with regularization, priors, or model selection (BIC/AIC).
Misinterpreting CIs and MGFs: Confidence intervals are not probabilities of parameters, and MGFs may not exist for all distributions (e.g., heavy tails).

Key Formulas

PMF normalization

x \in X \sum p (x) = 1, p (x) \geq 0

Explanation: Discrete probabilities must be nonnegative and sum to 1 across the support. This ensures a valid probability assignment.

PDF normalization

\int_{- \infty}^{\infty} f (x) d x = 1, f (x) \geq 0

Explanation: The total area under a continuous density curve equals 1. Only integrals over intervals represent probabilities.

CDF

F (x) = P (X \leq x) = t \leq x \sum p (t) or \int_{- \infty}^{x} f (t) d t

Explanation: The cumulative probability up to x is the sum (discrete) or integral (continuous) of the distribution. It is useful for thresholding and quantiles.

Expectation

E [X] = x \sum x p (x) or E [X] = \int x f (x) d x

Explanation: The mean is the probability-weighted average of possible values. It summarizes the central tendency.

Variance via moments

Var (X) = E [X^{2}] - (E [X])^{2}

Explanation: Variance equals the second moment minus the square of the mean. It measures spread around the average.

MGF definition

M_{X} (t) = E [e^{tX}]

Explanation: The moment-generating function, when it exists, compactly encodes all moments. Differentiating k times at t=0 yields the k-th moment.

Binomial

p (k; n, p) = (k n) p^{k} (1 - p)^{n - k}, E [X] = n p, Var (X) = n p (1 - p)

Explanation: Counts successes in n Bernoulli trials. The mean and variance grow linearly with n.

Poisson

p (k; λ) = e^{- λ} \frac{λ ^{k}}{k !}, E [X] = λ, Var (X) = λ

Explanation: Models rare event counts with rate $λ .$ Mean equals variance, a diagnostic property.

Bernoulli

p (x; p) = p^{x} (1 - p)^{1 - x}, x \in {0, 1}

Explanation: A single binary trial with success probability p. Often used for labels and switches.

Gaussian

f (x; μ, σ^{2}) = \frac{1}{2 π σ ^{2}} exp (- \frac{( x - μ ) ^{2}}{2 σ ^{2}})

Explanation: Bell-shaped density specified by mean μ and variance $σ^{2} .$ Central in statistics and ML.

Exponential

f (x; λ) = λ e^{- λ x}, x \geq 0

Explanation: Models memoryless waiting times with rate $λ .$ The mean waiting time is 1/ $λ .$

Gamma

f (x; α, β) = \frac{β ^{α}}{Γ ( α )} x^{α - 1} e^{- β x}, x \geq 0

Explanation: Flexible family for positive-valued data; sums of exponential variables are gamma distributed.

Beta

f (x; α, β) = \frac{Γ ( α + β )}{Γ ( α ) Γ ( β )} x^{α - 1} (1 - x)^{β - 1}, 0 \leq x \leq 1

Explanation: Distribution over probabilities on [0,1]. Useful as a prior for Bernoulli parameters.

Uniform

f (x; a, b) = \frac{1}{b - a}, a \leq x \leq b

Explanation: Every value in [a,b] is equally likely. A common base for simulation.

Exponential-family form

p (x ∣ θ) = h (x) exp (η (θ)^{⊤} T (x) - A (θ))

Explanation: Unified template covering many common distributions. A( $θ)$ ensures normalization; T(x) are sufficient statistics.

Change of variables

f_{Y} (y) = f_{X} (g^{- 1} (y)) det J_{g^{- 1}} (y)

Explanation: When transforming random variables with an invertible differentiable map g, densities scale by the absolute Jacobian determinant to conserve probability.

Mixture model

p (x) = k = 1 \sum K π_{k} p_{k} (x), k \sum π_{k} = 1, π_{k} \geq 0

Explanation: Weighted sum of component densities creates multimodal shapes. Used for clustering and flexible density estimation.

Multivariate Gaussian

f (x; μ, Σ) = \frac{1}{( 2 π ) ^{d /2} det ( Σ )} exp (- \frac{1}{2} (x - μ)^{⊤} Σ^{- 1} (x - μ))

Explanation: Generalizes the normal to vectors with covariance matrix Σ capturing correlations among dimensions.

Dirichlet

p (x; α) = \frac{1}{B ( α )} k = 1 \prod K x_{k}^{α_{k} - 1}, k \sum x_{k} = 1, x_{k} \geq 0

Explanation: A distribution over probability vectors on the simplex. Conjugate prior for multinomial parameters.

Multinomial

p (n; n, p) = \frac{n !}{\prod _{k} n _{k} !} k = 1 \prod K p_{k}^{n_{k}}, k \sum n_{k} = n, k \sum p_{k} = 1

Explanation: Counts per category over n independent draws. Generalizes the binomial to K classes.

Complexity Analysis

Probability operations decompose into three main tasks: evaluating likelihoods/densities, sampling, and parameter estimation. Evaluating a univariate PMF/PDF is typically O(1) time per point (e.g., Gaussian, Exponential). Care is needed for combinatorial terms (binomial coefficients), which should be computed with stable functions (like lgamma) to avoid O(n) factorial computations and overflow. CDFs may require special functions or numerical integration; standard libraries provide O(1) approximations for common distributions. Sampling via C++ <

r an d o m > u ses

efficient algorithms: inversion, rejection, or transformation methods. Each sample is generated in expected O(1) time for typical distributions (Bernoulli, Poisson, normal). Custom methods like Box–Muller produce two normals in constant time per pair. Transformations that require Jacobians don’t add asymptotic cost beyond function evaluation. Parameter estimation depends on the model. For single-parameter exponential-family distributions with i.i.d. data, maximum likelihood reduces to computing sufficient statistics (e.g., sums), which is O(n) time and O(1) space in a single pass. For mixture models trained with Expectation–Maximization (EM), each iteration requires computing responsibilities for each data point and component, yielding O(nK) time and O(nK) space if responsibilities are stored; the number of iterations I multiplies the cost to O(nKI). In high-dimensional multivariate Gaussians, evaluating the log-density involves matrix inverses or Cholesky decompositions: precomputing Σ^{-1} and log det Σ costs O(

d^{3}

), while per-point evaluation is O(

d^{2}

). Memory usage is dominated by storing data (O(nd)), parameters (O(K

d^{2}

) for full-covariance mixtures), and any cached sufficient statistics (typically O(Kd) to O(K

d^{2}

)).

Code Examples

Sampling and empirical moments for common distributions

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct Moments {
5     double mean{0.0};
6     double var{0.0};
7 };
8 
9 // Compute empirical mean and variance from samples
10 Moments empirical_moments(const vector<double>& x) {
11     Moments m;
12     if (x.empty()) return m;
13     double s = accumulate(x.begin(), x.end(), 0.0);
14     m.mean = s / x.size();
15     double ss = 0.0;
16     for (double v : x) ss += (v - m.mean) * (v - m.mean);
17     m.var = ss / x.size(); // population variance
18     return m;
19 }
20 
21 int main() {
22     ios::sync_with_stdio(false);
23     cin.tie(nullptr);
24 
25     std::mt19937 rng(12345); // fixed seed for reproducibility
26     const int N = 100000;    // number of samples per distribution
27 
28     // 1) Bernoulli(p)
29     double p = 0.3;
30     bernoulli_distribution bern(p);
31     vector<double> bern_samp; bern_samp.reserve(N);
32     for (int i = 0; i < N; ++i) bern_samp.push_back(bern(rng) ? 1.0 : 0.0);
33     auto bern_m = empirical_moments(bern_samp);
34     cout << fixed << setprecision(4);
35     cout << "Bernoulli(p=0.3): mean(emp)=" << bern_m.mean
36          << " var(emp)=" << bern_m.var
37          << " | mean(th)=" << p << " var(th)=" << p*(1-p) << "\n";
38 
39     // 2) Binomial(n,p)
40     int n = 10; double pb = 0.6;
41     binomial_distribution<int> binom(n, pb);
42     vector<double> bin_samp; bin_samp.reserve(N);
43     for (int i = 0; i < N; ++i) bin_samp.push_back(static_cast<double>(binom(rng)));
44     auto bin_m = empirical_moments(bin_samp);
45     cout << "Binomial(n=10,p=0.6): mean(emp)=" << bin_m.mean
46          << " var(emp)=" << bin_m.var
47          << " | mean(th)=" << n*pb << " var(th)=" << n*pb*(1-pb) << "\n";
48 
49     // 3) Poisson(λ)
50     double lambda = 3.5;
51     poisson_distribution<int> pois(lambda);
52     vector<double> poi_samp; poi_samp.reserve(N);
53     for (int i = 0; i < N; ++i) poi_samp.push_back(static_cast<double>(pois(rng)));
54     auto poi_m = empirical_moments(poi_samp);
55     cout << "Poisson(λ=3.5): mean(emp)=" << poi_m.mean
56          << " var(emp)=" << poi_m.var
57          << " | mean(th)=" << lambda << " var(th)=" << lambda << "\n";
58 
59     // 4) Normal(μ,σ)
60     double mu = 2.0, sigma = 1.5;
61     normal_distribution<double> norm(mu, sigma);
62     vector<double> nor_samp; nor_samp.reserve(N);
63     for (int i = 0; i < N; ++i) nor_samp.push_back(norm(rng));
64     auto nor_m = empirical_moments(nor_samp);
65     cout << "Normal(μ=2,σ=1.5): mean(emp)=" << nor_m.mean
66          << " var(emp)=" << nor_m.var
67          << " | mean(th)=" << mu << " var(th)=" << sigma*sigma << "\n";
68 
69     // 5) Exponential(λ)
70     double lam = 2.0; // rate
71     exponential_distribution<double> expo(lam);
72     vector<double> exp_samp; exp_samp.reserve(N);
73     for (int i = 0; i < N; ++i) exp_samp.push_back(expo(rng));
74     auto exp_m = empirical_moments(exp_samp);
75     cout << "Exponential(λ=2): mean(emp)=" << exp_m.mean
76          << " var(emp)=" << exp_m.var
77          << " | mean(th)=" << 1.0/lam << " var(th)=" << 1.0/(lam*lam) << "\n";
78 
79     // 6) Uniform(a,b)
80     double a = -1.0, b = 3.0;
81     uniform_real_distribution<double> uni(a, b);
82     vector<double> uni_samp; uni_samp.reserve(N);
83     for (int i = 0; i < N; ++i) uni_samp.push_back(uni(rng));
84     auto uni_m = empirical_moments(uni_samp);
85     cout << "Uniform(a=-1,b=3): mean(emp)=" << uni_m.mean
86          << " var(emp)=" << uni_m.var
87          << " | mean(th)=" << (a+b)/2.0 << " var(th)=" << (b-a)*(b-a)/12.0 << "\n";
88 
89     return 0;
90 }
91

Uses C++ <random> to sample from several common distributions and compares empirical mean/variance to theoretical values. This demonstrates PMF/PDF-driven sampling and moments as summary statistics.

Time: O(N) per distribution sampled, where N is the number of samples.Space: O(N) to store samples; can be reduced to O(1) by streaming moments.

Box–Muller transform and change-of-variables check

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Generate N(0,1) using Box–Muller from two independent U(0,1)
5 pair<double,double> box_muller(double u1, double u2) {
6     double R = sqrt(-2.0 * log(max(1e-12, u1))); // avoid log(0)
7     double theta = 2.0 * M_PI * u2;
8     return {R * cos(theta), R * sin(theta)}; // two independent normals
9 }
10 
11 // Standard normal PDF
12 inline double phi(double x) {
13     static const double inv_sqrt_2pi = 1.0 / sqrt(2.0 * M_PI);
14     return inv_sqrt_2pi * exp(-0.5 * x * x);
15 }
16 
17 int main(){
18     ios::sync_with_stdio(false);
19     cin.tie(nullptr);
20 
21     mt19937 rng(42);
22     uniform_real_distribution<double> U(0.0, 1.0);
23 
24     // Sample many normals via Box–Muller
25     const int N = 100000;
26     vector<double> Z; Z.reserve(2*N);
27     for (int i = 0; i < N; ++i) {
28         auto u1 = U(rng); auto u2 = U(rng);
29         auto [z1, z2] = box_muller(u1, u2);
30         Z.push_back(z1); Z.push_back(z2);
31     }
32 
33     // Affine transform: Y = a*Z + b, with Jacobian 1/|a|
34     double a = 1.7, b = -0.5;
35     vector<double> Y; Y.reserve(Z.size());
36     for (double z : Z) Y.push_back(a*z + b);
37 
38     // Compare empirical density of Y vs. analytic change-of-variables
39     int bins = 60; double ymin = -6.0, ymax = 6.0;
40     vector<int> hist(bins, 0);
41     for (double y : Y) {
42         if (y < ymin || y > ymax) continue;
43         int idx = int((y - ymin) / (ymax - ymin) * bins);
44         if (idx == bins) idx = bins - 1;
45         hist[idx]++;
46     }
47 
48     cout << fixed << setprecision(4);
49     cout << "bin_center\tempirical_pdf\tanalytic_pdf\n";
50     for (int i = 0; i < bins; ++i) {
51         double center = ymin + (i + 0.5) * (ymax - ymin) / bins;
52         double width = (ymax - ymin) / bins;
53         double emp_pdf = hist[i] / (double)Y.size() / width; // count / (N * bin width)
54         // Analytic: if Z~N(0,1), then Y=aZ+b ~ N(b, a^2)
55         double analytic = (1.0 / fabs(a)) * phi((center - b) / a);
56         cout << center << "\t" << emp_pdf << "\t" << analytic << "\n";
57     }
58 
59     return 0;
60 }
61

Generates standard normals via the Box–Muller transform (a change-of-variables construction). Then applies an affine transformation and compares a histogram-based empirical PDF to the analytic density obtained by the Jacobian rule, verifying f_Y(y)=f_Z((y−b)/a)/|a|.

Time: O(N) to generate 2N normals and O(N) to bin them.Space: O(N) to store samples; histogram alone is O(bins).

EM algorithm for a 1D Gaussian Mixture Model (K=2)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct GMM1D {
5     vector<double> mu;      // means
6     vector<double> var;     // variances
7     vector<double> pi;      // mixing weights
8 };
9 
10 inline double normal_pdf(double x, double mu, double var) {
11     double coeff = 1.0 / sqrt(2.0 * M_PI * var);
12     double z = (x - mu);
13     return coeff * exp(-0.5 * z * z / var);
14 }
15 
16 // Log-sum-exp for numerical stability: log(sum_i exp(a_i))
17 double log_sum_exp(const vector<double>& a) {
18     double m = *max_element(a.begin(), a.end());
19     double s = 0.0;
20     for (double v : a) s += exp(v - m);
21     return m + log(s);
22 }
23 
24 // One EM iteration: updates parameters in-place, returns log-likelihood
25 double em_step(const vector<double>& x, GMM1D& model) {
26     int n = (int)x.size();
27     int K = (int)model.mu.size();
28 
29     // E-step: responsibilities gamma[n][k] ∝ pi_k * N(x_n | mu_k, var_k)
30     vector<vector<double>> gamma(n, vector<double>(K));
31     double loglik = 0.0;
32     for (int i = 0; i < n; ++i) {
33         vector<double> logs(K);
34         for (int k = 0; k < K; ++k) {
35             double coeff = -0.5 * log(2.0 * M_PI * model.var[k]);
36             double z = x[i] - model.mu[k];
37             double quad = -0.5 * (z * z) / model.var[k];
38             logs[k] = log(model.pi[k]) + (coeff + quad);
39         }
40         double lse = log_sum_exp(logs);
41         loglik += lse;
42         for (int k = 0; k < K; ++k) gamma[i][k] = exp(logs[k] - lse);
43     }
44 
45     // M-step: update pi, mu, var using responsibilities
46     vector<double> Nk(K, 0.0);
47     for (int i = 0; i < n; ++i)
48         for (int k = 0; k < K; ++k)
49             Nk[k] += gamma[i][k];
50 
51     // Avoid degenerate variances by adding small epsilon
52     const double eps = 1e-6;
53 
54     for (int k = 0; k < K; ++k) {
55         model.pi[k] = max(eps, Nk[k] / n);
56         // mean
57         double num = 0.0;
58         for (int i = 0; i < n; ++i) num += gamma[i][k] * x[i];
59         model.mu[k] = num / max(eps, Nk[k]);
60         // variance
61         double vnum = 0.0;
62         for (int i = 0; i < n; ++i) {
63             double d = x[i] - model.mu[k];
64             vnum += gamma[i][k] * d * d;
65         }
66         model.var[k] = max(1e-6, vnum / max(eps, Nk[k]));
67     }
68 
69     // Normalize mixing weights to sum to 1
70     double pis = accumulate(model.pi.begin(), model.pi.end(), 0.0);
71     for (double &w : model.pi) w /= pis;
72 
73     return loglik;
74 }
75 
76 int main(){
77     ios::sync_with_stdio(false);
78     cin.tie(nullptr);
79 
80     // Generate synthetic data from a 2-component GMM
81     mt19937 rng(2024);
82     normal_distribution<double> n1(-2.0, 0.7); // mean=-2, sd=0.7
83     normal_distribution<double> n2( 2.5, 1.0); // mean= 2.5, sd=1.0
84     bernoulli_distribution choose2(0.4);       // 40% from comp2
85 
86     int N = 2000;
87     vector<double> x; x.reserve(N);
88     for (int i = 0; i < N; ++i) {
89         if (choose2(rng)) x.push_back(n2(rng));
90         else              x.push_back(n1(rng));
91     }
92 
93     // Initialize model (poorly) to test EM convergence
94     GMM1D model;
95     model.mu = {-4.0, 3.5};
96     model.var = {1.0, 1.0};
97     model.pi = {0.5, 0.5};
98 
99     // Run EM
100     double last = -numeric_limits<double>::infinity();
101     for (int it = 0; it < 50; ++it) {
102         double ll = em_step(x, model);
103         cerr << "iter=" << it << " loglik=" << ll
104              << " | mu=[" << model.mu[0] << ", " << model.mu[1] << "]"
105              << " var=[" << model.var[0] << ", " << model.var[1] << "]"
106              << " pi=[" << model.pi[0] << ", " << model.pi[1] << "]\n";
107         if (ll - last < 1e-6) break; // convergence check
108         last = ll;
109     }
110 
111     // Classify by max responsibility
112     int correct_like = 0;
113     for (double xi : x) {
114         double p0 = model.pi[0] * normal_pdf(xi, model.mu[0], model.var[0]);
115         double p1 = model.pi[1] * normal_pdf(xi, model.mu[1], model.var[1]);
116         int label = (p1 > p0);
117         correct_like += label; // just to use the value; not ground-truth labels available
118     }
119 
120     cout << fixed << setprecision(4);
121     cout << "Learned GMM params:\n"
122          << "mu:  " << model.mu[0] << ", " << model.mu[1] << "\n"
123          << "var: " << model.var[0] << ", " << model.var[1] << "\n"
124          << "pi:  " << model.pi[0] << ", " << model.pi[1] << "\n";
125 
126     return 0;
127 }
128

Implements the Expectation–Maximization algorithm for a 1D two-component Gaussian mixture. It generates synthetic data, initializes parameters, alternates E- and M-steps with log-sum-exp stabilization, and prints convergence diagnostics and learned parameters.

Time: O(NK) per EM iteration; with K=2 this is O(N). Over I iterations, O(NI).Space: O(NK) for responsibilities; parameters require O(K).

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct Moments {
5	double mean{0.0};
6	double var{0.0};
7	};
8
9	// Compute empirical mean and variance from samples
10	Moments empirical_moments(const vector<double>& x) {
11	Moments m;
12	if (x.empty()) return m;
13	double s = accumulate(x.begin(), x.end(), 0.0);
14	m.mean = s / x.size();
15	double ss = 0.0;
16	for (double v : x) ss += (v - m.mean) * (v - m.mean);
17	m.var = ss / x.size(); // population variance
18	return m;
19	}
20
21	int main() {
22	ios::sync_with_stdio(false);
23	cin.tie(nullptr);
24
25	std::mt19937 rng(12345); // fixed seed for reproducibility
26	const int N = 100000; // number of samples per distribution
27
28	// 1) Bernoulli(p)
29	double p = 0.3;
30	bernoulli_distribution bern(p);
31	vector<double> bern_samp; bern_samp.reserve(N);
32	for (int i = 0; i < N; ++i) bern_samp.push_back(bern(rng) ? 1.0 : 0.0);
33	auto bern_m = empirical_moments(bern_samp);
34	cout << fixed << setprecision(4);
35	cout << "Bernoulli(p=0.3): mean(emp)=" << bern_m.mean
36	<< " var(emp)=" << bern_m.var
37	<< " \| mean(th)=" << p << " var(th)=" << p*(1-p) << "\n";
38
39	// 2) Binomial(n,p)
40	int n = 10; double pb = 0.6;
41	binomial_distribution<int> binom(n, pb);
42	vector<double> bin_samp; bin_samp.reserve(N);
43	for (int i = 0; i < N; ++i) bin_samp.push_back(static_cast<double>(binom(rng)));
44	auto bin_m = empirical_moments(bin_samp);
45	cout << "Binomial(n=10,p=0.6): mean(emp)=" << bin_m.mean
46	<< " var(emp)=" << bin_m.var
47	<< " \| mean(th)=" << npb << " var(th)=" << npb*(1-pb) << "\n";
48
49	// 3) Poisson(λ)
50	double lambda = 3.5;
51	poisson_distribution<int> pois(lambda);
52	vector<double> poi_samp; poi_samp.reserve(N);
53	for (int i = 0; i < N; ++i) poi_samp.push_back(static_cast<double>(pois(rng)));
54	auto poi_m = empirical_moments(poi_samp);
55	cout << "Poisson(λ=3.5): mean(emp)=" << poi_m.mean
56	<< " var(emp)=" << poi_m.var
57	<< " \| mean(th)=" << lambda << " var(th)=" << lambda << "\n";
58
59	// 4) Normal(μ,σ)
60	double mu = 2.0, sigma = 1.5;
61	normal_distribution<double> norm(mu, sigma);
62	vector<double> nor_samp; nor_samp.reserve(N);
63	for (int i = 0; i < N; ++i) nor_samp.push_back(norm(rng));
64	auto nor_m = empirical_moments(nor_samp);
65	cout << "Normal(μ=2,σ=1.5): mean(emp)=" << nor_m.mean
66	<< " var(emp)=" << nor_m.var
67	<< " \| mean(th)=" << mu << " var(th)=" << sigma*sigma << "\n";
68
69	// 5) Exponential(λ)
70	double lam = 2.0; // rate
71	exponential_distribution<double> expo(lam);
72	vector<double> exp_samp; exp_samp.reserve(N);
73	for (int i = 0; i < N; ++i) exp_samp.push_back(expo(rng));
74	auto exp_m = empirical_moments(exp_samp);
75	cout << "Exponential(λ=2): mean(emp)=" << exp_m.mean
76	<< " var(emp)=" << exp_m.var
77	<< " \| mean(th)=" << 1.0/lam << " var(th)=" << 1.0/(lam*lam) << "\n";
78
79	// 6) Uniform(a,b)
80	double a = -1.0, b = 3.0;
81	uniform_real_distribution<double> uni(a, b);
82	vector<double> uni_samp; uni_samp.reserve(N);
83	for (int i = 0; i < N; ++i) uni_samp.push_back(uni(rng));
84	auto uni_m = empirical_moments(uni_samp);
85	cout << "Uniform(a=-1,b=3): mean(emp)=" << uni_m.mean
86	<< " var(emp)=" << uni_m.var
87	<< " \| mean(th)=" << (a+b)/2.0 << " var(th)=" << (b-a)*(b-a)/12.0 << "\n";
88
89	return 0;
90	}
91

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Generate N(0,1) using Box–Muller from two independent U(0,1)
5	pair<double,double> box_muller(double u1, double u2) {
6	double R = sqrt(-2.0 * log(max(1e-12, u1))); // avoid log(0)
7	double theta = 2.0 * M_PI * u2;
8	return {R * cos(theta), R * sin(theta)}; // two independent normals
9	}
10
11	// Standard normal PDF
12	inline double phi(double x) {
13	static const double inv_sqrt_2pi = 1.0 / sqrt(2.0 * M_PI);
14	return inv_sqrt_2pi * exp(-0.5 * x * x);
15	}
16
17	int main(){
18	ios::sync_with_stdio(false);
19	cin.tie(nullptr);
20
21	mt19937 rng(42);
22	uniform_real_distribution<double> U(0.0, 1.0);
23
24	// Sample many normals via Box–Muller
25	const int N = 100000;
26	vector<double> Z; Z.reserve(2*N);
27	for (int i = 0; i < N; ++i) {
28	auto u1 = U(rng); auto u2 = U(rng);
29	auto [z1, z2] = box_muller(u1, u2);
30	Z.push_back(z1); Z.push_back(z2);
31	}
32
33	// Affine transform: Y = a*Z + b, with Jacobian 1/\|a\|
34	double a = 1.7, b = -0.5;
35	vector<double> Y; Y.reserve(Z.size());
36	for (double z : Z) Y.push_back(a*z + b);
37
38	// Compare empirical density of Y vs. analytic change-of-variables
39	int bins = 60; double ymin = -6.0, ymax = 6.0;
40	vector<int> hist(bins, 0);
41	for (double y : Y) {
42	if (y < ymin \|\| y > ymax) continue;
43	int idx = int((y - ymin) / (ymax - ymin) * bins);
44	if (idx == bins) idx = bins - 1;
45	hist[idx]++;
46	}
47
48	cout << fixed << setprecision(4);
49	cout << "bin_center\tempirical_pdf\tanalytic_pdf\n";
50	for (int i = 0; i < bins; ++i) {
51	double center = ymin + (i + 0.5) * (ymax - ymin) / bins;
52	double width = (ymax - ymin) / bins;
53	double emp_pdf = hist[i] / (double)Y.size() / width; // count / (N * bin width)
54	// Analytic: if Z~N(0,1), then Y=aZ+b ~ N(b, a^2)
55	double analytic = (1.0 / fabs(a)) * phi((center - b) / a);
56	cout << center << "\t" << emp_pdf << "\t" << analytic << "\n";
57	}
58
59	return 0;
60	}
61

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct GMM1D {
5	vector<double> mu; // means
6	vector<double> var; // variances
7	vector<double> pi; // mixing weights
8	};
9
10	inline double normal_pdf(double x, double mu, double var) {
11	double coeff = 1.0 / sqrt(2.0 * M_PI * var);
12	double z = (x - mu);
13	return coeff * exp(-0.5 * z * z / var);
14	}
15
16	// Log-sum-exp for numerical stability: log(sum_i exp(a_i))
17	double log_sum_exp(const vector<double>& a) {
18	double m = *max_element(a.begin(), a.end());
19	double s = 0.0;
20	for (double v : a) s += exp(v - m);
21	return m + log(s);
22	}
23
24	// One EM iteration: updates parameters in-place, returns log-likelihood
25	double em_step(const vector<double>& x, GMM1D& model) {
26	int n = (int)x.size();
27	int K = (int)model.mu.size();
28
29	// E-step: responsibilities gamma[n][k] ∝ pi_k * N(x_n \| mu_k, var_k)
30	vector<vector<double>> gamma(n, vector<double>(K));
31	double loglik = 0.0;
32	for (int i = 0; i < n; ++i) {
33	vector<double> logs(K);
34	for (int k = 0; k < K; ++k) {
35	double coeff = -0.5 * log(2.0 * M_PI * model.var[k]);
36	double z = x[i] - model.mu[k];
37	double quad = -0.5 * (z * z) / model.var[k];
38	logs[k] = log(model.pi[k]) + (coeff + quad);
39	}
40	double lse = log_sum_exp(logs);
41	loglik += lse;
42	for (int k = 0; k < K; ++k) gamma[i][k] = exp(logs[k] - lse);
43	}
44
45	// M-step: update pi, mu, var using responsibilities
46	vector<double> Nk(K, 0.0);
47	for (int i = 0; i < n; ++i)
48	for (int k = 0; k < K; ++k)
49	Nk[k] += gamma[i][k];
50
51	// Avoid degenerate variances by adding small epsilon
52	const double eps = 1e-6;
53
54	for (int k = 0; k < K; ++k) {
55	model.pi[k] = max(eps, Nk[k] / n);
56	// mean
57	double num = 0.0;
58	for (int i = 0; i < n; ++i) num += gamma[i][k] * x[i];
59	model.mu[k] = num / max(eps, Nk[k]);
60	// variance
61	double vnum = 0.0;
62	for (int i = 0; i < n; ++i) {
63	double d = x[i] - model.mu[k];
64	vnum += gamma[i][k] * d * d;
65	}
66	model.var[k] = max(1e-6, vnum / max(eps, Nk[k]));
67	}
68
69	// Normalize mixing weights to sum to 1
70	double pis = accumulate(model.pi.begin(), model.pi.end(), 0.0);
71	for (double &w : model.pi) w /= pis;
72
73	return loglik;
74	}
75
76	int main(){
77	ios::sync_with_stdio(false);
78	cin.tie(nullptr);
79
80	// Generate synthetic data from a 2-component GMM
81	mt19937 rng(2024);
82	normal_distribution<double> n1(-2.0, 0.7); // mean=-2, sd=0.7
83	normal_distribution<double> n2( 2.5, 1.0); // mean= 2.5, sd=1.0
84	bernoulli_distribution choose2(0.4); // 40% from comp2
85
86	int N = 2000;
87	vector<double> x; x.reserve(N);
88	for (int i = 0; i < N; ++i) {
89	if (choose2(rng)) x.push_back(n2(rng));
90	else x.push_back(n1(rng));
91	}
92
93	// Initialize model (poorly) to test EM convergence
94	GMM1D model;
95	model.mu = {-4.0, 3.5};
96	model.var = {1.0, 1.0};
97	model.pi = {0.5, 0.5};
98
99	// Run EM
100	double last = -numeric_limits<double>::infinity();
101	for (int it = 0; it < 50; ++it) {
102	double ll = em_step(x, model);
103	cerr << "iter=" << it << " loglik=" << ll
104	<< " \| mu=[" << model.mu[0] << ", " << model.mu[1] << "]"
105	<< " var=[" << model.var[0] << ", " << model.var[1] << "]"
106	<< " pi=[" << model.pi[0] << ", " << model.pi[1] << "]\n";
107	if (ll - last < 1e-6) break; // convergence check
108	last = ll;
109	}
110
111	// Classify by max responsibility
112	int correct_like = 0;
113	for (double xi : x) {
114	double p0 = model.pi[0] * normal_pdf(xi, model.mu[0], model.var[0]);
115	double p1 = model.pi[1] * normal_pdf(xi, model.mu[1], model.var[1]);
116	int label = (p1 > p0);
117	correct_like += label; // just to use the value; not ground-truth labels available
118	}
119
120	cout << fixed << setprecision(4);
121	cout << "Learned GMM params:\n"
122	<< "mu: " << model.mu[0] << ", " << model.mu[1] << "\n"
123	<< "var: " << model.var[0] << ", " << model.var[1] << "\n"
124	<< "pi: " << model.pi[0] << ", " << model.pi[1] << "\n";
125
126	return 0;
127	}
128