Probability Distributions
Key Points
- •Probability distributions describe how random outcomes are spread across possible values and let us compute probabilities, expectations, and uncertainties.
- •Discrete distributions (e.g., Bernoulli, Binomial, Poisson) count events; continuous ones (e.g., Gaussian, Exponential, Gamma, Beta, Uniform) model real-valued measurements.
- •Key tools include PMF/PDF (probability mass/density), CDF (cumulative), moments , and MGF (moment-generating function).
- •Many common distributions belong to the exponential family, which yields convenient sufficient statistics and simplifies estimation.
- •Transformations (e.g., Box–Muller) map simple random variables into complex ones using Jacobians to adjust densities.
- •Mixture models combine several component distributions to fit multimodal data and are learned with algorithms like EM.
- •In AI/ML, Gaussian assumptions power regression, categorical distributions power classification, and Poisson models count data.
- •C++’s <random> library supports efficient sampling; you can also implement custom samplers and estimators for learning.
Prerequisites
- →Probability axioms and conditional probability — Distributions build on the axioms and require understanding of events, independence, and conditional probabilities.
- →Single-variable calculus (integration and differentiation) — PDFs integrate to 1, CDFs are integrals of PDFs, and MGFs use expectations of exponentials.
- →Linear algebra (vectors, matrices, determinant) — Multivariate Gaussians involve covariance matrices, determinants, and quadratic forms.
- →Numerical stability and floating-point arithmetic — Stable evaluation of log-likelihoods and combinatorial terms avoids overflow/underflow.
- →C++ random number generation (<random>) — Practical sampling and simulation rely on engines and distribution objects in C++.
Detailed Explanation
Tap terms for definitions01Overview
A probability distribution is a mathematical description of how likely different outcomes are. For discrete outcomes (like coin flips), we use a probability mass function (PMF) that assigns probabilities to each possible value. For continuous outcomes (like heights), we use a probability density function (PDF) whose area under the curve over an interval gives the probability of landing in that interval. The cumulative distribution function (CDF) tells us the probability that a random variable is less than or equal to a value. Key numeric summaries, called moments (like mean and variance), describe the center and spread. The moment generating function (MGF) encodes all moments and can simplify proofs and derivations. Distributions come in families suited to different data: Bernoulli and Binomial for yes/no counts, Poisson for rare events per time window, Gaussian for aggregated noise, Exponential and Gamma for waiting times, Beta and Dirichlet for probabilities themselves, and Uniform when everything in a range is equally likely. Many of these fit into the exponential family, which yields elegant properties like sufficient statistics. In machine learning, we often assume data follow certain distributions to build models, estimate parameters from data (e.g., via maximum likelihood), and make predictions. Mixture models combine several distributions to capture complex, multimodal patterns.
02Intuition & Analogies
Imagine tossing a handful of dice. Before rolling, you don’t know the totals you’ll see, but you have a sense of which are common (like 7) and which are rare (like 2 or 12). A probability distribution is that “sense” written as math: it tells you, for every possible outcome, how plausible it is. For discrete outcomes, it’s like a weighted checklist: each item (outcome) has a weight (probability) and the weights add to 1. For continuous outcomes, think of a smooth hill: the hill’s height (density) is not a probability by itself, but the area under the hill over an interval is. Different everyday processes create characteristic shapes. Flipping a coin is Bernoulli; repeating it n times gives a Binomial that looks bell-shaped when n is large. Counting rare events (like daily website outages) fits a Poisson curve. Waiting for the next bus is Exponential if arrivals are memoryless. Measurements with many tiny, independent errors tend to look Gaussian because all the little pushes and pulls add up to a bell curve. When you’re unsure about a probability itself (e.g., “what’s this click-through rate?”), Beta and Dirichlet act like flexible knobs to encode your prior belief. Sometimes one simple hill isn’t enough—think of heights of adults in a crowd containing both children and grownups: two bumps. Mixture models let you add hills (components) together. And if you only know how to sample uniform randomness, transformations (like Box–Muller) are recipes to reshape it into other distributions by stretching and bending, carefully adjusting areas so total probability stays 1.
03Formal Definition
04When to Use
- Bernoulli/Binomial: Binary outcomes and counts of successes (A/B tests, clicks). Use when each trial is independent with constant success probability.
- Poisson: Count events in fixed intervals when events occur independently and rarely (call arrivals, defects). Often paired with Exponential for inter-arrival times.
- Gaussian: When many independent small effects add (sensor noise, regression residuals). The Central Limit Theorem justifies Gaussian approximations.
- Exponential/Gamma: Model waiting/processing times; Exponential if memoryless, Gamma if you sum multiple exponential stages.
- Beta/Dirichlet: Uncertainty over probabilities (priors/posteriors in Bayesian models for Bernoulli/Categorical or Multinomial data).
- Uniform: When everything in [a, b] is equally plausible; also as a building block for other samplers.
- Multinomial/Categorical: Multi-class outcomes and counts (word categories, class labels).
- Multivariate Gaussian: Continuous vectors with linear correlations (Kalman filters, Gaussian processes approximations).
- Mixture models (e.g., Gaussian Mixtures): Data with multiple clusters/modes; unsupervised learning and density estimation.
- Transformations: When you only have uniform randomness or need to reparameterize variables; used in simulation, normalizing flows, and variational inference.
⚠️Common Mistakes
- Confusing PDF/PMF with probability: For continuous variables, f(x) can exceed 1; only integrals over intervals are probabilities.
- Using the wrong parameterization: E.g., mixing up Exponential rate (\lambda) with scale (\beta = 1/\lambda), or Beta(\alpha,\beta) vs. mean/variance forms. Always check definitions.
- Ignoring support: Evaluating a PDF outside its support (e.g., negative values for an Exponential) or forgetting that Categorical probabilities must sum to 1 leads to invalid models.
- Violating independence/identical distribution assumptions: Binomial and Poisson models break if trials or events are dependent or rates vary widely over time.
- Numeric instability: Directly computing factorials or binomial coefficients can overflow. Use logs or stable functions (like lgamma) for large n.
- Overfitting with mixtures: Letting too many components fit noise; mitigate with regularization, priors, or model selection (BIC/AIC).
- Misinterpreting CIs and MGFs: Confidence intervals are not probabilities of parameters, and MGFs may not exist for all distributions (e.g., heavy tails).
Key Formulas
PMF normalization
Explanation: Discrete probabilities must be nonnegative and sum to 1 across the support. This ensures a valid probability assignment.
PDF normalization
Explanation: The total area under a continuous density curve equals 1. Only integrals over intervals represent probabilities.
CDF
Explanation: The cumulative probability up to x is the sum (discrete) or integral (continuous) of the distribution. It is useful for thresholding and quantiles.
Expectation
Explanation: The mean is the probability-weighted average of possible values. It summarizes the central tendency.
Variance via moments
Explanation: Variance equals the second moment minus the square of the mean. It measures spread around the average.
MGF definition
Explanation: The moment-generating function, when it exists, compactly encodes all moments. Differentiating k times at t=0 yields the k-th moment.
Binomial
Explanation: Counts successes in n Bernoulli trials. The mean and variance grow linearly with n.
Poisson
Explanation: Models rare event counts with rate Mean equals variance, a diagnostic property.
Bernoulli
Explanation: A single binary trial with success probability p. Often used for labels and switches.
Gaussian
Explanation: Bell-shaped density specified by mean μ and variance Central in statistics and ML.
Exponential
Explanation: Models memoryless waiting times with rate The mean waiting time is 1/
Gamma
Explanation: Flexible family for positive-valued data; sums of exponential variables are gamma distributed.
Beta
Explanation: Distribution over probabilities on [0,1]. Useful as a prior for Bernoulli parameters.
Uniform
Explanation: Every value in [a,b] is equally likely. A common base for simulation.
Exponential-family form
Explanation: Unified template covering many common distributions. A( ensures normalization; T(x) are sufficient statistics.
Change of variables
Explanation: When transforming random variables with an invertible differentiable map g, densities scale by the absolute Jacobian determinant to conserve probability.
Mixture model
Explanation: Weighted sum of component densities creates multimodal shapes. Used for clustering and flexible density estimation.
Multivariate Gaussian
Explanation: Generalizes the normal to vectors with covariance matrix Σ capturing correlations among dimensions.
Dirichlet
Explanation: A distribution over probability vectors on the simplex. Conjugate prior for multinomial parameters.
Multinomial
Explanation: Counts per category over n independent draws. Generalizes the binomial to K classes.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct Moments { 5 double mean{0.0}; 6 double var{0.0}; 7 }; 8 9 // Compute empirical mean and variance from samples 10 Moments empirical_moments(const vector<double>& x) { 11 Moments m; 12 if (x.empty()) return m; 13 double s = accumulate(x.begin(), x.end(), 0.0); 14 m.mean = s / x.size(); 15 double ss = 0.0; 16 for (double v : x) ss += (v - m.mean) * (v - m.mean); 17 m.var = ss / x.size(); // population variance 18 return m; 19 } 20 21 int main() { 22 ios::sync_with_stdio(false); 23 cin.tie(nullptr); 24 25 std::mt19937 rng(12345); // fixed seed for reproducibility 26 const int N = 100000; // number of samples per distribution 27 28 // 1) Bernoulli(p) 29 double p = 0.3; 30 bernoulli_distribution bern(p); 31 vector<double> bern_samp; bern_samp.reserve(N); 32 for (int i = 0; i < N; ++i) bern_samp.push_back(bern(rng) ? 1.0 : 0.0); 33 auto bern_m = empirical_moments(bern_samp); 34 cout << fixed << setprecision(4); 35 cout << "Bernoulli(p=0.3): mean(emp)=" << bern_m.mean 36 << " var(emp)=" << bern_m.var 37 << " | mean(th)=" << p << " var(th)=" << p*(1-p) << "\n"; 38 39 // 2) Binomial(n,p) 40 int n = 10; double pb = 0.6; 41 binomial_distribution<int> binom(n, pb); 42 vector<double> bin_samp; bin_samp.reserve(N); 43 for (int i = 0; i < N; ++i) bin_samp.push_back(static_cast<double>(binom(rng))); 44 auto bin_m = empirical_moments(bin_samp); 45 cout << "Binomial(n=10,p=0.6): mean(emp)=" << bin_m.mean 46 << " var(emp)=" << bin_m.var 47 << " | mean(th)=" << n*pb << " var(th)=" << n*pb*(1-pb) << "\n"; 48 49 // 3) Poisson(λ) 50 double lambda = 3.5; 51 poisson_distribution<int> pois(lambda); 52 vector<double> poi_samp; poi_samp.reserve(N); 53 for (int i = 0; i < N; ++i) poi_samp.push_back(static_cast<double>(pois(rng))); 54 auto poi_m = empirical_moments(poi_samp); 55 cout << "Poisson(λ=3.5): mean(emp)=" << poi_m.mean 56 << " var(emp)=" << poi_m.var 57 << " | mean(th)=" << lambda << " var(th)=" << lambda << "\n"; 58 59 // 4) Normal(μ,σ) 60 double mu = 2.0, sigma = 1.5; 61 normal_distribution<double> norm(mu, sigma); 62 vector<double> nor_samp; nor_samp.reserve(N); 63 for (int i = 0; i < N; ++i) nor_samp.push_back(norm(rng)); 64 auto nor_m = empirical_moments(nor_samp); 65 cout << "Normal(μ=2,σ=1.5): mean(emp)=" << nor_m.mean 66 << " var(emp)=" << nor_m.var 67 << " | mean(th)=" << mu << " var(th)=" << sigma*sigma << "\n"; 68 69 // 5) Exponential(λ) 70 double lam = 2.0; // rate 71 exponential_distribution<double> expo(lam); 72 vector<double> exp_samp; exp_samp.reserve(N); 73 for (int i = 0; i < N; ++i) exp_samp.push_back(expo(rng)); 74 auto exp_m = empirical_moments(exp_samp); 75 cout << "Exponential(λ=2): mean(emp)=" << exp_m.mean 76 << " var(emp)=" << exp_m.var 77 << " | mean(th)=" << 1.0/lam << " var(th)=" << 1.0/(lam*lam) << "\n"; 78 79 // 6) Uniform(a,b) 80 double a = -1.0, b = 3.0; 81 uniform_real_distribution<double> uni(a, b); 82 vector<double> uni_samp; uni_samp.reserve(N); 83 for (int i = 0; i < N; ++i) uni_samp.push_back(uni(rng)); 84 auto uni_m = empirical_moments(uni_samp); 85 cout << "Uniform(a=-1,b=3): mean(emp)=" << uni_m.mean 86 << " var(emp)=" << uni_m.var 87 << " | mean(th)=" << (a+b)/2.0 << " var(th)=" << (b-a)*(b-a)/12.0 << "\n"; 88 89 return 0; 90 } 91
Uses C++ <random> to sample from several common distributions and compares empirical mean/variance to theoretical values. This demonstrates PMF/PDF-driven sampling and moments as summary statistics.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Generate N(0,1) using Box–Muller from two independent U(0,1) 5 pair<double,double> box_muller(double u1, double u2) { 6 double R = sqrt(-2.0 * log(max(1e-12, u1))); // avoid log(0) 7 double theta = 2.0 * M_PI * u2; 8 return {R * cos(theta), R * sin(theta)}; // two independent normals 9 } 10 11 // Standard normal PDF 12 inline double phi(double x) { 13 static const double inv_sqrt_2pi = 1.0 / sqrt(2.0 * M_PI); 14 return inv_sqrt_2pi * exp(-0.5 * x * x); 15 } 16 17 int main(){ 18 ios::sync_with_stdio(false); 19 cin.tie(nullptr); 20 21 mt19937 rng(42); 22 uniform_real_distribution<double> U(0.0, 1.0); 23 24 // Sample many normals via Box–Muller 25 const int N = 100000; 26 vector<double> Z; Z.reserve(2*N); 27 for (int i = 0; i < N; ++i) { 28 auto u1 = U(rng); auto u2 = U(rng); 29 auto [z1, z2] = box_muller(u1, u2); 30 Z.push_back(z1); Z.push_back(z2); 31 } 32 33 // Affine transform: Y = a*Z + b, with Jacobian 1/|a| 34 double a = 1.7, b = -0.5; 35 vector<double> Y; Y.reserve(Z.size()); 36 for (double z : Z) Y.push_back(a*z + b); 37 38 // Compare empirical density of Y vs. analytic change-of-variables 39 int bins = 60; double ymin = -6.0, ymax = 6.0; 40 vector<int> hist(bins, 0); 41 for (double y : Y) { 42 if (y < ymin || y > ymax) continue; 43 int idx = int((y - ymin) / (ymax - ymin) * bins); 44 if (idx == bins) idx = bins - 1; 45 hist[idx]++; 46 } 47 48 cout << fixed << setprecision(4); 49 cout << "bin_center\tempirical_pdf\tanalytic_pdf\n"; 50 for (int i = 0; i < bins; ++i) { 51 double center = ymin + (i + 0.5) * (ymax - ymin) / bins; 52 double width = (ymax - ymin) / bins; 53 double emp_pdf = hist[i] / (double)Y.size() / width; // count / (N * bin width) 54 // Analytic: if Z~N(0,1), then Y=aZ+b ~ N(b, a^2) 55 double analytic = (1.0 / fabs(a)) * phi((center - b) / a); 56 cout << center << "\t" << emp_pdf << "\t" << analytic << "\n"; 57 } 58 59 return 0; 60 } 61
Generates standard normals via the Box–Muller transform (a change-of-variables construction). Then applies an affine transformation and compares a histogram-based empirical PDF to the analytic density obtained by the Jacobian rule, verifying f_Y(y)=f_Z((y−b)/a)/|a|.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct GMM1D { 5 vector<double> mu; // means 6 vector<double> var; // variances 7 vector<double> pi; // mixing weights 8 }; 9 10 inline double normal_pdf(double x, double mu, double var) { 11 double coeff = 1.0 / sqrt(2.0 * M_PI * var); 12 double z = (x - mu); 13 return coeff * exp(-0.5 * z * z / var); 14 } 15 16 // Log-sum-exp for numerical stability: log(sum_i exp(a_i)) 17 double log_sum_exp(const vector<double>& a) { 18 double m = *max_element(a.begin(), a.end()); 19 double s = 0.0; 20 for (double v : a) s += exp(v - m); 21 return m + log(s); 22 } 23 24 // One EM iteration: updates parameters in-place, returns log-likelihood 25 double em_step(const vector<double>& x, GMM1D& model) { 26 int n = (int)x.size(); 27 int K = (int)model.mu.size(); 28 29 // E-step: responsibilities gamma[n][k] ∝ pi_k * N(x_n | mu_k, var_k) 30 vector<vector<double>> gamma(n, vector<double>(K)); 31 double loglik = 0.0; 32 for (int i = 0; i < n; ++i) { 33 vector<double> logs(K); 34 for (int k = 0; k < K; ++k) { 35 double coeff = -0.5 * log(2.0 * M_PI * model.var[k]); 36 double z = x[i] - model.mu[k]; 37 double quad = -0.5 * (z * z) / model.var[k]; 38 logs[k] = log(model.pi[k]) + (coeff + quad); 39 } 40 double lse = log_sum_exp(logs); 41 loglik += lse; 42 for (int k = 0; k < K; ++k) gamma[i][k] = exp(logs[k] - lse); 43 } 44 45 // M-step: update pi, mu, var using responsibilities 46 vector<double> Nk(K, 0.0); 47 for (int i = 0; i < n; ++i) 48 for (int k = 0; k < K; ++k) 49 Nk[k] += gamma[i][k]; 50 51 // Avoid degenerate variances by adding small epsilon 52 const double eps = 1e-6; 53 54 for (int k = 0; k < K; ++k) { 55 model.pi[k] = max(eps, Nk[k] / n); 56 // mean 57 double num = 0.0; 58 for (int i = 0; i < n; ++i) num += gamma[i][k] * x[i]; 59 model.mu[k] = num / max(eps, Nk[k]); 60 // variance 61 double vnum = 0.0; 62 for (int i = 0; i < n; ++i) { 63 double d = x[i] - model.mu[k]; 64 vnum += gamma[i][k] * d * d; 65 } 66 model.var[k] = max(1e-6, vnum / max(eps, Nk[k])); 67 } 68 69 // Normalize mixing weights to sum to 1 70 double pis = accumulate(model.pi.begin(), model.pi.end(), 0.0); 71 for (double &w : model.pi) w /= pis; 72 73 return loglik; 74 } 75 76 int main(){ 77 ios::sync_with_stdio(false); 78 cin.tie(nullptr); 79 80 // Generate synthetic data from a 2-component GMM 81 mt19937 rng(2024); 82 normal_distribution<double> n1(-2.0, 0.7); // mean=-2, sd=0.7 83 normal_distribution<double> n2( 2.5, 1.0); // mean= 2.5, sd=1.0 84 bernoulli_distribution choose2(0.4); // 40% from comp2 85 86 int N = 2000; 87 vector<double> x; x.reserve(N); 88 for (int i = 0; i < N; ++i) { 89 if (choose2(rng)) x.push_back(n2(rng)); 90 else x.push_back(n1(rng)); 91 } 92 93 // Initialize model (poorly) to test EM convergence 94 GMM1D model; 95 model.mu = {-4.0, 3.5}; 96 model.var = {1.0, 1.0}; 97 model.pi = {0.5, 0.5}; 98 99 // Run EM 100 double last = -numeric_limits<double>::infinity(); 101 for (int it = 0; it < 50; ++it) { 102 double ll = em_step(x, model); 103 cerr << "iter=" << it << " loglik=" << ll 104 << " | mu=[" << model.mu[0] << ", " << model.mu[1] << "]" 105 << " var=[" << model.var[0] << ", " << model.var[1] << "]" 106 << " pi=[" << model.pi[0] << ", " << model.pi[1] << "]\n"; 107 if (ll - last < 1e-6) break; // convergence check 108 last = ll; 109 } 110 111 // Classify by max responsibility 112 int correct_like = 0; 113 for (double xi : x) { 114 double p0 = model.pi[0] * normal_pdf(xi, model.mu[0], model.var[0]); 115 double p1 = model.pi[1] * normal_pdf(xi, model.mu[1], model.var[1]); 116 int label = (p1 > p0); 117 correct_like += label; // just to use the value; not ground-truth labels available 118 } 119 120 cout << fixed << setprecision(4); 121 cout << "Learned GMM params:\n" 122 << "mu: " << model.mu[0] << ", " << model.mu[1] << "\n" 123 << "var: " << model.var[0] << ", " << model.var[1] << "\n" 124 << "pi: " << model.pi[0] << ", " << model.pi[1] << "\n"; 125 126 return 0; 127 } 128
Implements the Expectation–Maximization algorithm for a 1D two-component Gaussian mixture. It generates synthetic data, initializes parameters, alternates E- and M-steps with log-sum-exp stabilization, and prints convergence diagnostics and learned parameters.