📚TheoryIntermediate

Probability Distributions

Key Points

  • Probability distributions describe how random outcomes are spread across possible values and let us compute probabilities, expectations, and uncertainties.
  • Discrete distributions (e.g., Bernoulli, Binomial, Poisson) count events; continuous ones (e.g., Gaussian, Exponential, Gamma, Beta, Uniform) model real-valued measurements.
  • Key tools include PMF/PDF (probability mass/density), CDF (cumulative), moments , and MGF (moment-generating function).
  • Many common distributions belong to the exponential family, which yields convenient sufficient statistics and simplifies estimation.
  • Transformations (e.g., Box–Muller) map simple random variables into complex ones using Jacobians to adjust densities.
  • Mixture models combine several component distributions to fit multimodal data and are learned with algorithms like EM.
  • In AI/ML, Gaussian assumptions power regression, categorical distributions power classification, and Poisson models count data.
  • C++’s <random> library supports efficient sampling; you can also implement custom samplers and estimators for learning.

Prerequisites

  • Probability axioms and conditional probabilityDistributions build on the axioms and require understanding of events, independence, and conditional probabilities.
  • Single-variable calculus (integration and differentiation)PDFs integrate to 1, CDFs are integrals of PDFs, and MGFs use expectations of exponentials.
  • Linear algebra (vectors, matrices, determinant)Multivariate Gaussians involve covariance matrices, determinants, and quadratic forms.
  • Numerical stability and floating-point arithmeticStable evaluation of log-likelihoods and combinatorial terms avoids overflow/underflow.
  • C++ random number generation (<random>)Practical sampling and simulation rely on engines and distribution objects in C++.

Detailed Explanation

Tap terms for definitions

01Overview

A probability distribution is a mathematical description of how likely different outcomes are. For discrete outcomes (like coin flips), we use a probability mass function (PMF) that assigns probabilities to each possible value. For continuous outcomes (like heights), we use a probability density function (PDF) whose area under the curve over an interval gives the probability of landing in that interval. The cumulative distribution function (CDF) tells us the probability that a random variable is less than or equal to a value. Key numeric summaries, called moments (like mean and variance), describe the center and spread. The moment generating function (MGF) encodes all moments and can simplify proofs and derivations. Distributions come in families suited to different data: Bernoulli and Binomial for yes/no counts, Poisson for rare events per time window, Gaussian for aggregated noise, Exponential and Gamma for waiting times, Beta and Dirichlet for probabilities themselves, and Uniform when everything in a range is equally likely. Many of these fit into the exponential family, which yields elegant properties like sufficient statistics. In machine learning, we often assume data follow certain distributions to build models, estimate parameters from data (e.g., via maximum likelihood), and make predictions. Mixture models combine several distributions to capture complex, multimodal patterns.

02Intuition & Analogies

Imagine tossing a handful of dice. Before rolling, you don’t know the totals you’ll see, but you have a sense of which are common (like 7) and which are rare (like 2 or 12). A probability distribution is that “sense” written as math: it tells you, for every possible outcome, how plausible it is. For discrete outcomes, it’s like a weighted checklist: each item (outcome) has a weight (probability) and the weights add to 1. For continuous outcomes, think of a smooth hill: the hill’s height (density) is not a probability by itself, but the area under the hill over an interval is. Different everyday processes create characteristic shapes. Flipping a coin is Bernoulli; repeating it n times gives a Binomial that looks bell-shaped when n is large. Counting rare events (like daily website outages) fits a Poisson curve. Waiting for the next bus is Exponential if arrivals are memoryless. Measurements with many tiny, independent errors tend to look Gaussian because all the little pushes and pulls add up to a bell curve. When you’re unsure about a probability itself (e.g., “what’s this click-through rate?”), Beta and Dirichlet act like flexible knobs to encode your prior belief. Sometimes one simple hill isn’t enough—think of heights of adults in a crowd containing both children and grownups: two bumps. Mixture models let you add hills (components) together. And if you only know how to sample uniform randomness, transformations (like Box–Muller) are recipes to reshape it into other distributions by stretching and bending, carefully adjusting areas so total probability stays 1.

03Formal Definition

A probability distribution specifies a random variable X by giving its PMF p(x) for discrete X or PDF f(x) for continuous X, with normalization \( p(x) = 1\) or \( f(x)\,dx = 1\) over the support. The CDF is \(F(x) = (X x)\). The k-th moment is \([]\); the mean is \( = [X]\); the variance is \( = [(X-)^2]\). The moment-generating function (MGF) is \((t) = []\) when it exists and uniquely characterizes many distributions. An exponential-family distribution has density/mass of the form \(p(x) = h(x) (()^{} T(x) - A())\), where \(T(x)\) are sufficient statistics, \(()\) is the natural parameter, and \(A()\) ensures normalization. Sufficient statistics \(T(x)\) summarize data so that the likelihood of parameters depends on the data only through \( T()\). A mixture model combines components: \(p(x) = (x)\), with mixing weights \( 0\), \( = 1\). Under variable transformations \()\) (invertible and differentiable), the density changes via \((y) = ((y))\, J_{}(y)\), where \(J\) is the Jacobian.

04When to Use

  • Bernoulli/Binomial: Binary outcomes and counts of successes (A/B tests, clicks). Use when each trial is independent with constant success probability.
  • Poisson: Count events in fixed intervals when events occur independently and rarely (call arrivals, defects). Often paired with Exponential for inter-arrival times.
  • Gaussian: When many independent small effects add (sensor noise, regression residuals). The Central Limit Theorem justifies Gaussian approximations.
  • Exponential/Gamma: Model waiting/processing times; Exponential if memoryless, Gamma if you sum multiple exponential stages.
  • Beta/Dirichlet: Uncertainty over probabilities (priors/posteriors in Bayesian models for Bernoulli/Categorical or Multinomial data).
  • Uniform: When everything in [a, b] is equally plausible; also as a building block for other samplers.
  • Multinomial/Categorical: Multi-class outcomes and counts (word categories, class labels).
  • Multivariate Gaussian: Continuous vectors with linear correlations (Kalman filters, Gaussian processes approximations).
  • Mixture models (e.g., Gaussian Mixtures): Data with multiple clusters/modes; unsupervised learning and density estimation.
  • Transformations: When you only have uniform randomness or need to reparameterize variables; used in simulation, normalizing flows, and variational inference.

⚠️Common Mistakes

  • Confusing PDF/PMF with probability: For continuous variables, f(x) can exceed 1; only integrals over intervals are probabilities.
  • Using the wrong parameterization: E.g., mixing up Exponential rate (\lambda) with scale (\beta = 1/\lambda), or Beta(\alpha,\beta) vs. mean/variance forms. Always check definitions.
  • Ignoring support: Evaluating a PDF outside its support (e.g., negative values for an Exponential) or forgetting that Categorical probabilities must sum to 1 leads to invalid models.
  • Violating independence/identical distribution assumptions: Binomial and Poisson models break if trials or events are dependent or rates vary widely over time.
  • Numeric instability: Directly computing factorials or binomial coefficients can overflow. Use logs or stable functions (like lgamma) for large n.
  • Overfitting with mixtures: Letting too many components fit noise; mitigate with regularization, priors, or model selection (BIC/AIC).
  • Misinterpreting CIs and MGFs: Confidence intervals are not probabilities of parameters, and MGFs may not exist for all distributions (e.g., heavy tails).

Key Formulas

PMF normalization

Explanation: Discrete probabilities must be nonnegative and sum to 1 across the support. This ensures a valid probability assignment.

PDF normalization

Explanation: The total area under a continuous density curve equals 1. Only integrals over intervals represent probabilities.

CDF

Explanation: The cumulative probability up to x is the sum (discrete) or integral (continuous) of the distribution. It is useful for thresholding and quantiles.

Expectation

Explanation: The mean is the probability-weighted average of possible values. It summarizes the central tendency.

Variance via moments

Explanation: Variance equals the second moment minus the square of the mean. It measures spread around the average.

MGF definition

Explanation: The moment-generating function, when it exists, compactly encodes all moments. Differentiating k times at t=0 yields the k-th moment.

Binomial

Explanation: Counts successes in n Bernoulli trials. The mean and variance grow linearly with n.

Poisson

Explanation: Models rare event counts with rate Mean equals variance, a diagnostic property.

Bernoulli

Explanation: A single binary trial with success probability p. Often used for labels and switches.

Gaussian

Explanation: Bell-shaped density specified by mean μ and variance Central in statistics and ML.

Exponential

Explanation: Models memoryless waiting times with rate The mean waiting time is 1/

Gamma

Explanation: Flexible family for positive-valued data; sums of exponential variables are gamma distributed.

Beta

Explanation: Distribution over probabilities on [0,1]. Useful as a prior for Bernoulli parameters.

Uniform

Explanation: Every value in [a,b] is equally likely. A common base for simulation.

Exponential-family form

Explanation: Unified template covering many common distributions. A( ensures normalization; T(x) are sufficient statistics.

Change of variables

Explanation: When transforming random variables with an invertible differentiable map g, densities scale by the absolute Jacobian determinant to conserve probability.

Mixture model

Explanation: Weighted sum of component densities creates multimodal shapes. Used for clustering and flexible density estimation.

Multivariate Gaussian

Explanation: Generalizes the normal to vectors with covariance matrix Σ capturing correlations among dimensions.

Dirichlet

Explanation: A distribution over probability vectors on the simplex. Conjugate prior for multinomial parameters.

Multinomial

Explanation: Counts per category over n independent draws. Generalizes the binomial to K classes.

Complexity Analysis

Probability operations decompose into three main tasks: evaluating likelihoods/densities, sampling, and parameter estimation. Evaluating a univariate PMF/PDF is typically O(1) time per point (e.g., Gaussian, Exponential). Care is needed for combinatorial terms (binomial coefficients), which should be computed with stable functions (like lgamma) to avoid O(n) factorial computations and overflow. CDFs may require special functions or numerical integration; standard libraries provide O(1) approximations for common distributions. Sampling via C++ < efficient algorithms: inversion, rejection, or transformation methods. Each sample is generated in expected O(1) time for typical distributions (Bernoulli, Poisson, normal). Custom methods like Box–Muller produce two normals in constant time per pair. Transformations that require Jacobians don’t add asymptotic cost beyond function evaluation. Parameter estimation depends on the model. For single-parameter exponential-family distributions with i.i.d. data, maximum likelihood reduces to computing sufficient statistics (e.g., sums), which is O(n) time and O(1) space in a single pass. For mixture models trained with Expectation–Maximization (EM), each iteration requires computing responsibilities for each data point and component, yielding O(nK) time and O(nK) space if responsibilities are stored; the number of iterations I multiplies the cost to O(nKI). In high-dimensional multivariate Gaussians, evaluating the log-density involves matrix inverses or Cholesky decompositions: precomputing Σ^{-1} and log det Σ costs O(), while per-point evaluation is O(). Memory usage is dominated by storing data (O(nd)), parameters (O(K) for full-covariance mixtures), and any cached sufficient statistics (typically O(Kd) to O(K)).

Code Examples

Sampling and empirical moments for common distributions
1#include <bits/stdc++.h>
2using namespace std;
3
4struct Moments {
5 double mean{0.0};
6 double var{0.0};
7};
8
9// Compute empirical mean and variance from samples
10Moments empirical_moments(const vector<double>& x) {
11 Moments m;
12 if (x.empty()) return m;
13 double s = accumulate(x.begin(), x.end(), 0.0);
14 m.mean = s / x.size();
15 double ss = 0.0;
16 for (double v : x) ss += (v - m.mean) * (v - m.mean);
17 m.var = ss / x.size(); // population variance
18 return m;
19}
20
21int main() {
22 ios::sync_with_stdio(false);
23 cin.tie(nullptr);
24
25 std::mt19937 rng(12345); // fixed seed for reproducibility
26 const int N = 100000; // number of samples per distribution
27
28 // 1) Bernoulli(p)
29 double p = 0.3;
30 bernoulli_distribution bern(p);
31 vector<double> bern_samp; bern_samp.reserve(N);
32 for (int i = 0; i < N; ++i) bern_samp.push_back(bern(rng) ? 1.0 : 0.0);
33 auto bern_m = empirical_moments(bern_samp);
34 cout << fixed << setprecision(4);
35 cout << "Bernoulli(p=0.3): mean(emp)=" << bern_m.mean
36 << " var(emp)=" << bern_m.var
37 << " | mean(th)=" << p << " var(th)=" << p*(1-p) << "\n";
38
39 // 2) Binomial(n,p)
40 int n = 10; double pb = 0.6;
41 binomial_distribution<int> binom(n, pb);
42 vector<double> bin_samp; bin_samp.reserve(N);
43 for (int i = 0; i < N; ++i) bin_samp.push_back(static_cast<double>(binom(rng)));
44 auto bin_m = empirical_moments(bin_samp);
45 cout << "Binomial(n=10,p=0.6): mean(emp)=" << bin_m.mean
46 << " var(emp)=" << bin_m.var
47 << " | mean(th)=" << n*pb << " var(th)=" << n*pb*(1-pb) << "\n";
48
49 // 3) Poisson(λ)
50 double lambda = 3.5;
51 poisson_distribution<int> pois(lambda);
52 vector<double> poi_samp; poi_samp.reserve(N);
53 for (int i = 0; i < N; ++i) poi_samp.push_back(static_cast<double>(pois(rng)));
54 auto poi_m = empirical_moments(poi_samp);
55 cout << "Poisson(λ=3.5): mean(emp)=" << poi_m.mean
56 << " var(emp)=" << poi_m.var
57 << " | mean(th)=" << lambda << " var(th)=" << lambda << "\n";
58
59 // 4) Normal(μ,σ)
60 double mu = 2.0, sigma = 1.5;
61 normal_distribution<double> norm(mu, sigma);
62 vector<double> nor_samp; nor_samp.reserve(N);
63 for (int i = 0; i < N; ++i) nor_samp.push_back(norm(rng));
64 auto nor_m = empirical_moments(nor_samp);
65 cout << "Normal(μ=2,σ=1.5): mean(emp)=" << nor_m.mean
66 << " var(emp)=" << nor_m.var
67 << " | mean(th)=" << mu << " var(th)=" << sigma*sigma << "\n";
68
69 // 5) Exponential(λ)
70 double lam = 2.0; // rate
71 exponential_distribution<double> expo(lam);
72 vector<double> exp_samp; exp_samp.reserve(N);
73 for (int i = 0; i < N; ++i) exp_samp.push_back(expo(rng));
74 auto exp_m = empirical_moments(exp_samp);
75 cout << "Exponential(λ=2): mean(emp)=" << exp_m.mean
76 << " var(emp)=" << exp_m.var
77 << " | mean(th)=" << 1.0/lam << " var(th)=" << 1.0/(lam*lam) << "\n";
78
79 // 6) Uniform(a,b)
80 double a = -1.0, b = 3.0;
81 uniform_real_distribution<double> uni(a, b);
82 vector<double> uni_samp; uni_samp.reserve(N);
83 for (int i = 0; i < N; ++i) uni_samp.push_back(uni(rng));
84 auto uni_m = empirical_moments(uni_samp);
85 cout << "Uniform(a=-1,b=3): mean(emp)=" << uni_m.mean
86 << " var(emp)=" << uni_m.var
87 << " | mean(th)=" << (a+b)/2.0 << " var(th)=" << (b-a)*(b-a)/12.0 << "\n";
88
89 return 0;
90}
91

Uses C++ <random> to sample from several common distributions and compares empirical mean/variance to theoretical values. This demonstrates PMF/PDF-driven sampling and moments as summary statistics.

Time: O(N) per distribution sampled, where N is the number of samples.Space: O(N) to store samples; can be reduced to O(1) by streaming moments.
Box–Muller transform and change-of-variables check
1#include <bits/stdc++.h>
2using namespace std;
3
4// Generate N(0,1) using Box–Muller from two independent U(0,1)
5pair<double,double> box_muller(double u1, double u2) {
6 double R = sqrt(-2.0 * log(max(1e-12, u1))); // avoid log(0)
7 double theta = 2.0 * M_PI * u2;
8 return {R * cos(theta), R * sin(theta)}; // two independent normals
9}
10
11// Standard normal PDF
12inline double phi(double x) {
13 static const double inv_sqrt_2pi = 1.0 / sqrt(2.0 * M_PI);
14 return inv_sqrt_2pi * exp(-0.5 * x * x);
15}
16
17int main(){
18 ios::sync_with_stdio(false);
19 cin.tie(nullptr);
20
21 mt19937 rng(42);
22 uniform_real_distribution<double> U(0.0, 1.0);
23
24 // Sample many normals via Box–Muller
25 const int N = 100000;
26 vector<double> Z; Z.reserve(2*N);
27 for (int i = 0; i < N; ++i) {
28 auto u1 = U(rng); auto u2 = U(rng);
29 auto [z1, z2] = box_muller(u1, u2);
30 Z.push_back(z1); Z.push_back(z2);
31 }
32
33 // Affine transform: Y = a*Z + b, with Jacobian 1/|a|
34 double a = 1.7, b = -0.5;
35 vector<double> Y; Y.reserve(Z.size());
36 for (double z : Z) Y.push_back(a*z + b);
37
38 // Compare empirical density of Y vs. analytic change-of-variables
39 int bins = 60; double ymin = -6.0, ymax = 6.0;
40 vector<int> hist(bins, 0);
41 for (double y : Y) {
42 if (y < ymin || y > ymax) continue;
43 int idx = int((y - ymin) / (ymax - ymin) * bins);
44 if (idx == bins) idx = bins - 1;
45 hist[idx]++;
46 }
47
48 cout << fixed << setprecision(4);
49 cout << "bin_center\tempirical_pdf\tanalytic_pdf\n";
50 for (int i = 0; i < bins; ++i) {
51 double center = ymin + (i + 0.5) * (ymax - ymin) / bins;
52 double width = (ymax - ymin) / bins;
53 double emp_pdf = hist[i] / (double)Y.size() / width; // count / (N * bin width)
54 // Analytic: if Z~N(0,1), then Y=aZ+b ~ N(b, a^2)
55 double analytic = (1.0 / fabs(a)) * phi((center - b) / a);
56 cout << center << "\t" << emp_pdf << "\t" << analytic << "\n";
57 }
58
59 return 0;
60}
61

Generates standard normals via the Box–Muller transform (a change-of-variables construction). Then applies an affine transformation and compares a histogram-based empirical PDF to the analytic density obtained by the Jacobian rule, verifying f_Y(y)=f_Z((y−b)/a)/|a|.

Time: O(N) to generate 2N normals and O(N) to bin them.Space: O(N) to store samples; histogram alone is O(bins).
EM algorithm for a 1D Gaussian Mixture Model (K=2)
1#include <bits/stdc++.h>
2using namespace std;
3
4struct GMM1D {
5 vector<double> mu; // means
6 vector<double> var; // variances
7 vector<double> pi; // mixing weights
8};
9
10inline double normal_pdf(double x, double mu, double var) {
11 double coeff = 1.0 / sqrt(2.0 * M_PI * var);
12 double z = (x - mu);
13 return coeff * exp(-0.5 * z * z / var);
14}
15
16// Log-sum-exp for numerical stability: log(sum_i exp(a_i))
17double log_sum_exp(const vector<double>& a) {
18 double m = *max_element(a.begin(), a.end());
19 double s = 0.0;
20 for (double v : a) s += exp(v - m);
21 return m + log(s);
22}
23
24// One EM iteration: updates parameters in-place, returns log-likelihood
25double em_step(const vector<double>& x, GMM1D& model) {
26 int n = (int)x.size();
27 int K = (int)model.mu.size();
28
29 // E-step: responsibilities gamma[n][k] ∝ pi_k * N(x_n | mu_k, var_k)
30 vector<vector<double>> gamma(n, vector<double>(K));
31 double loglik = 0.0;
32 for (int i = 0; i < n; ++i) {
33 vector<double> logs(K);
34 for (int k = 0; k < K; ++k) {
35 double coeff = -0.5 * log(2.0 * M_PI * model.var[k]);
36 double z = x[i] - model.mu[k];
37 double quad = -0.5 * (z * z) / model.var[k];
38 logs[k] = log(model.pi[k]) + (coeff + quad);
39 }
40 double lse = log_sum_exp(logs);
41 loglik += lse;
42 for (int k = 0; k < K; ++k) gamma[i][k] = exp(logs[k] - lse);
43 }
44
45 // M-step: update pi, mu, var using responsibilities
46 vector<double> Nk(K, 0.0);
47 for (int i = 0; i < n; ++i)
48 for (int k = 0; k < K; ++k)
49 Nk[k] += gamma[i][k];
50
51 // Avoid degenerate variances by adding small epsilon
52 const double eps = 1e-6;
53
54 for (int k = 0; k < K; ++k) {
55 model.pi[k] = max(eps, Nk[k] / n);
56 // mean
57 double num = 0.0;
58 for (int i = 0; i < n; ++i) num += gamma[i][k] * x[i];
59 model.mu[k] = num / max(eps, Nk[k]);
60 // variance
61 double vnum = 0.0;
62 for (int i = 0; i < n; ++i) {
63 double d = x[i] - model.mu[k];
64 vnum += gamma[i][k] * d * d;
65 }
66 model.var[k] = max(1e-6, vnum / max(eps, Nk[k]));
67 }
68
69 // Normalize mixing weights to sum to 1
70 double pis = accumulate(model.pi.begin(), model.pi.end(), 0.0);
71 for (double &w : model.pi) w /= pis;
72
73 return loglik;
74}
75
76int main(){
77 ios::sync_with_stdio(false);
78 cin.tie(nullptr);
79
80 // Generate synthetic data from a 2-component GMM
81 mt19937 rng(2024);
82 normal_distribution<double> n1(-2.0, 0.7); // mean=-2, sd=0.7
83 normal_distribution<double> n2( 2.5, 1.0); // mean= 2.5, sd=1.0
84 bernoulli_distribution choose2(0.4); // 40% from comp2
85
86 int N = 2000;
87 vector<double> x; x.reserve(N);
88 for (int i = 0; i < N; ++i) {
89 if (choose2(rng)) x.push_back(n2(rng));
90 else x.push_back(n1(rng));
91 }
92
93 // Initialize model (poorly) to test EM convergence
94 GMM1D model;
95 model.mu = {-4.0, 3.5};
96 model.var = {1.0, 1.0};
97 model.pi = {0.5, 0.5};
98
99 // Run EM
100 double last = -numeric_limits<double>::infinity();
101 for (int it = 0; it < 50; ++it) {
102 double ll = em_step(x, model);
103 cerr << "iter=" << it << " loglik=" << ll
104 << " | mu=[" << model.mu[0] << ", " << model.mu[1] << "]"
105 << " var=[" << model.var[0] << ", " << model.var[1] << "]"
106 << " pi=[" << model.pi[0] << ", " << model.pi[1] << "]\n";
107 if (ll - last < 1e-6) break; // convergence check
108 last = ll;
109 }
110
111 // Classify by max responsibility
112 int correct_like = 0;
113 for (double xi : x) {
114 double p0 = model.pi[0] * normal_pdf(xi, model.mu[0], model.var[0]);
115 double p1 = model.pi[1] * normal_pdf(xi, model.mu[1], model.var[1]);
116 int label = (p1 > p0);
117 correct_like += label; // just to use the value; not ground-truth labels available
118 }
119
120 cout << fixed << setprecision(4);
121 cout << "Learned GMM params:\n"
122 << "mu: " << model.mu[0] << ", " << model.mu[1] << "\n"
123 << "var: " << model.var[0] << ", " << model.var[1] << "\n"
124 << "pi: " << model.pi[0] << ", " << model.pi[1] << "\n";
125
126 return 0;
127}
128

Implements the Expectation–Maximization algorithm for a 1D two-component Gaussian mixture. It generates synthetic data, initializes parameters, alternates E- and M-steps with log-sum-exp stabilization, and prints convergence diagnostics and learned parameters.

Time: O(NK) per EM iteration; with K=2 this is O(N). Over I iterations, O(NI).Space: O(NK) for responsibilities; parameters require O(K).