Probability Theory
Key Points
- âąProbability theory formalizes uncertainty using a sample space, events, and a probability measure that obeys clear axioms.
- âąRandom variables are measurable functions that map outcomes to numbers, enabling us to compute expectations and variances.
- âąBayesâ theorem lets us update beliefs with data, forming the core of many machine learning methods.
- âąIndependence and conditional probability describe how events and variables relate, affecting how we factor and compute probabilities.
- âąThe law of large numbers explains why averages stabilize, and the central limit theorem explains why many sums look Gaussian.
- âąCommon distributions (Bernoulli, Binomial, Poisson, Gaussian, Exponential, Gamma, Beta) model diverse phenomena from counts to waiting times.
- âąMonte Carlo simulation in C++ using <random> is a practical way to estimate probabilities and expectations when formulas are hard.
- âąCareful attention to assumptions (independence, proper conditioning, finite variance) avoids common pitfalls in modeling and inference.
Prerequisites
- âSet theory and functions â Events are sets; random variables are functions from outcomes to numbers.
- âCalculus (integration and differentiation) â Expectations for continuous variables are integrals; densities and MGFs use calculus.
- âSequences and limits â LLN and CLT involve convergence concepts and limit behavior.
- âLinear algebra (vectors, covariance matrices) â Multivariate normals, covariance/correlation, and Gaussian models use linear algebra.
- âBasic statistics â Interpretation of mean, variance, confidence/credible intervals relies on statistical concepts.
- âC++ programming basics â Implementing simulations requires familiarity with syntax, loops, and standard libraries.
- âPseudorandom number generation â Monte Carlo requires proper RNG engines and distributions in C++.
- âCombinatorics â Counting arguments derive PMFs for discrete distributions like Binomial.
Detailed Explanation
Tap terms for definitions01Overview
Probability theory is the mathematics of uncertainty. It gives us a language to describe random phenomenaâlike coin flips, weather, or noisy sensorsâand rules to compute meaningful quantities like the likelihood of an event, the average outcome, and how much results vary. At its core is the Kolmogorov axiomatic framework: we define a sample space of possible outcomes, a collection of events we can talk about (a sigma-algebra), and a probability measure that assigns numbers to events in a consistent way. On top of this foundation, random variables turn outcomes into numbers so that we can compute expectations, variances, and other statistics. Key tools include Bayesâ theorem for updating beliefs, independence for factorizing probabilities, and limit theorems (law of large numbers and central limit theorem) that explain why averages stabilize and why Gaussian curves appear so often. In machine learning, probability is the backbone: loss functions, generative models, Bayesian inference, and even optimization under uncertainty rely on probabilistic thinking. Practically, when formulas get messy, simulation (Monte Carlo) provides approximate answers, and modern C++ offers robust facilities for random number generation and sampling from standard distributions.
02Intuition & Analogies
Imagine youâre trying to predict if it will rain tomorrow. You donât know for sure, but you can quantify your uncertaintyâmaybe a 30% chance. Probability is like a bookkeeping system for uncertainty. The sample space is the list of all possible tomorrows (rain, no rain); events are statements you can test ("it rains"); and the probability measure is a rule that tells you how strongly you believe each event will happen. Random variables are like sensors attached to outcomes: they look at what happened and output a number (1 if it rains, 0 otherwise; or how many millimeters fall). Expectation is a long-run averageâif you repeated tomorrow many times in parallel universes, the average of the random variableâs outputs is its expectation. Variance measures how wiggly those outputs are around the average. Bayesâ theorem is an update rule: you start with a prior belief (chance of rain), observe evidence (dark clouds), and produce a posterior belief (now you think rain is more likely). Independence is like two coin flips that donât influence each other; conditional probability is like asking, "given that clouds are dark, how likely is rain?" The law of large numbers is the reason averages settle downâflip a fair coin thousands of times and the fraction of heads approaches 0.5. The central limit theorem explains why sums of many small, independent effects often look bell-shaped, which is why Gaussian noise appears everywhereâfrom physics to sensor readings in robotics.
03Formal Definition
04When to Use
Use probability theory whenever outcomes are uncertain or subject to noise. In software engineering and simulation, model failure rates, latency spikes, or randomized algorithms using Bernoulli, Exponential, or Gaussian distributions. In data science and machine learning, apply Bayesâ theorem for posterior inference, Naive Bayes for classification, and probabilistic models (e.g., Gaussian mixtures) for clustering. For A/B testing, use Binomial and Beta distributions to reason about click-through rates and compute credible intervals. In queuing and reliability, use Poisson processes for counts and Exponential/Gamma distributions for waiting times. When exact formulas are hard or unavailable, use Monte Carlo in C++: sample from <random> distributions to estimate expectations E[g(X)] or probabilities P(A) by averaging indicators. Rely on the law of large numbers to justify convergence of these estimates, and use the central limit theorem to build confidence intervals around them. Use independence assumptions to factor complex joint distributions into simpler pieces; when independence is false, use conditional probabilities and covariance to capture dependencies. Finally, reach for conjugate priors (e.g., Beta-Binomial, Gamma-Poisson) to obtain closed-form Bayesian updates.
â ïžCommon Mistakes
- Confusing P(A\mid B) with P(B\mid A). Bayesâ theorem shows they are linked by the prior odds; swapping them without adjustment leads to base-rate fallacies. Avoid this by explicitly writing P(A\mid B) = P(B\mid A)P(A)/P(B). - Assuming independence without justification. Many signals in real data are correlated; incorrectly factoring P(X,Y) as P(X)P(Y) can severely bias estimates. Check correlations or design experiments that enforce independence. - Treating densities as probabilities. For continuous X, f(x) can exceed 1 and P(X = x) = 0; only integrals over intervals yield probabilities. Always integrate densities over sets to get probabilities. - Ignoring variance and uncertainty. Reporting only point estimates (like an average) can be misleading; include variability (variance, standard error) and, when possible, confidence or credible intervals. - Misusing the central limit theorem. CLT requires independence (or weak dependence) and finite variance; heavy-tailed distributions may violate these, delaying convergence or changing the limiting law. Verify conditions or use robust methods. - Double counting in the law of total probability. Ensure {A_i} form a partition (disjoint and covering the space) before summing P(B\mid A_i)P(A_i). - Numerical instability and sampling bias. Poor RNG seeding or small sample sizes can produce misleading Monte Carlo results; use high-quality engines (mt19937), sufficient samples, and variance reduction techniques when possible.
Key Formulas
Kolmogorov Axioms (Normalization)
Explanation: The whole space has probability 1 and the empty set has probability 0. These anchor the scale of probability.
Countable Additivity
Explanation: The probability of a countable union of disjoint events equals the sum of their probabilities. This extends finite additivity to infinite collections.
Complement Rule
Explanation: The probability that A does not happen is one minus the probability that it does. Useful for computing tail probabilities.
InclusionâExclusion (Two Sets)
Explanation: To avoid double counting overlap, subtract the intersection once. This generalizes to more sets.
Conditional Probability
Explanation: Defines the probability of A given that B occurs. Requires P(B)>0.
Bayesâ Theorem
Explanation: Updates prior P(A) with likelihood P(B|A) to produce posterior P(A|B). The denominator P(B) normalizes over all causes of B.
Law of Total Probability
Explanation: If {} forms a partition, the overall probability of B is a weighted sum over the conditional probabilities in each partition cell.
Independence (Events)
Explanation: A and B are independent if knowing one does not change the probability of the other. For random variables, joint densities factor similarly.
Expectation
Explanation: Expectation is the long-run average of a random variable. Use a sum for discrete variables and an integral for continuous ones.
Variance via Moments
Explanation: Variance equals the second moment minus the square of the first moment. It measures spread around the mean.
Covariance
Explanation: Covariance measures the linear dependence between X and Y. Zero covariance does not imply independence in general.
LOTUS
Explanation: The Law of the Unconscious Statistician lets you compute the expectation of a function of X without explicitly finding the distribution of g(X).
Chebyshevâs Inequality
Explanation: Regardless of distribution (with finite variance), the probability of being k standard deviations from the mean is at most 1/. Useful for conservative bounds.
Markovâs Inequality
Explanation: For nonnegative X, the probability of exceeding a threshold is at most the mean divided by that threshold. A simple but general tail bound.
Law of Large Numbers (Strong)
Explanation: The sample mean of iid variables with mean converges almost surely to . This justifies Monte Carlo averaging.
Central Limit Theorem
Explanation: Properly standardized sums of iid variables converge in distribution to a standard normal. Enables approximate confidence intervals.
Moment Generating Function
Explanation: The MGF, when it exists near 0, uniquely determines the distribution and encodes all moments via derivatives at 0.
Convolution (Continuous)
Explanation: The density of a sum of independent continuous variables is the convolution of their densities. Discrete analog uses sums.
Binomial PMF
Explanation: Probability of k successes in n independent Bernoulli trials with success probability p. Models counts in fixed trials.
Poisson PMF
Explanation: Counts events in a fixed interval when events occur independently at rate . Often used for rare events.
Normal PDF
Explanation: The bell-shaped density with mean and variance . Central in statistics and signal processing.
Exponential PDF
Explanation: Models waiting times with memoryless property at rate . Mean and variance are 1/ and 1/.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 int main() { 5 ios::sync_with_stdio(false); 6 cin.tie(nullptr); 7 8 // Seed a high-quality Mersenne Twister engine 9 random_device rd; 10 mt19937 gen(rd()); 11 12 // Standard normal distribution N(0,1) 13 normal_distribution<double> normal01(0.0, 1.0); 14 15 const size_t N = 200000; // number of samples 16 17 // Accumulators for mean and second moment (for variance) 18 long double sum = 0.0L; 19 long double sumsq = 0.0L; 20 21 // Estimate probability P(|X| <= 1) 22 size_t count_in = 0; 23 24 for (size_t i = 0; i < N; ++i) { 25 double x = normal01(gen); 26 sum += x; 27 sumsq += x * x; 28 if (fabs(x) <= 1.0) ++count_in; 29 } 30 31 long double mean = sum / N; 32 long double var = sumsq / N - mean * mean; // Var(X) = E[X^2] - (E[X])^2 33 long double p_interval = static_cast<long double>(count_in) / N; 34 35 // Theoretical values for comparison 36 // For N(0,1): E[X]=0, Var(X)=1, and P(|X| <= 1) = erf(1/sqrt(2)) 37 long double theo_mean = 0.0L; 38 long double theo_var = 1.0L; 39 long double theo_p_interval = erf(1.0L / sqrtl(2.0L)); 40 41 cout.setf(std::ios::fixed); cout << setprecision(6); 42 cout << "Samples: " << N << "\n"; 43 cout << "Estimated mean : " << mean << " (theory: " << theo_mean << ")\n"; 44 cout << "Estimated variance : " << var << " (theory: " << theo_var << ")\n"; 45 cout << "Estimated P(|X|<=1) : " << p_interval << " (theory: " << theo_p_interval << ")\n"; 46 47 return 0; 48 } 49
This program samples N values from a standard normal distribution using C++ <random>, then estimates the mean, variance, and the probability that |X| †1. It compares Monte Carlo estimates to known theoretical values, illustrating E[X], Var(X), and probabilities computed as expectations of indicator functions. By the law of large numbers, the estimates approach the truth as N grows.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Compute posterior P(Disease | TestPositive) given prior, sensitivity, and specificity 5 // prior = P(D), sens = P(T+ | D), spec = P(T- | ~D) 6 double bayes_posterior(double prior, double sens, double spec) { 7 double pD = prior; 8 double pNotD = 1.0 - pD; 9 double pTpos_given_D = sens; 10 double pTpos_given_notD = 1.0 - spec; // false positive rate 11 double pTpos = pTpos_given_D * pD + pTpos_given_notD * pNotD; // law of total probability 12 return (pTpos_given_D * pD) / pTpos; 13 } 14 15 int main() { 16 ios::sync_with_stdio(false); 17 cin.tie(nullptr); 18 19 double prior = 0.01; // 1% prevalence 20 double sens = 0.95; // sensitivity: 95% 21 double spec = 0.90; // specificity: 90% 22 23 double exact = bayes_posterior(prior, sens, spec); 24 25 // Monte Carlo validation 26 random_device rd; mt19937 gen(rd()); 27 bernoulli_distribution disease(prior); 28 bernoulli_distribution tp(sens); // true positive when disease present 29 bernoulli_distribution tn(spec); // true negative when disease absent 30 31 const size_t N = 500000; // trials 32 size_t test_pos = 0; 33 size_t dise_and_pos = 0; 34 35 for (size_t i = 0; i < N; ++i) { 36 bool D = disease(gen); 37 bool Tpos; 38 if (D) { 39 Tpos = tp(gen); 40 } else { 41 // if no disease, test is positive with prob (1 - specificity) 42 bernoulli_distribution fp(1.0 - spec); 43 Tpos = fp(gen); 44 } 45 if (Tpos) { 46 ++test_pos; 47 if (D) ++dise_and_pos; 48 } 49 } 50 51 double mc = (test_pos > 0) ? (double)dise_and_pos / test_pos : 0.0; 52 53 cout.setf(std::ios::fixed); cout << setprecision(6); 54 cout << "Exact P(D|T+): " << exact << "\n"; 55 cout << "MC P(D|T+): " << mc << " (N=" << N << ")\n"; 56 57 return 0; 58 } 59
We compute the posterior probability of disease after a positive test using Bayesâ theorem from prior prevalence, sensitivity, and specificity. We then simulate many patients: draw disease status, generate test results based on sensitivity/specificity, and estimate P(D|T+) as dise_and_pos/test_pos. The Monte Carlo estimate converges to the exact posterior by the law of large numbers.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 int main() { 5 ios::sync_with_stdio(false); 6 cin.tie(nullptr); 7 8 random_device rd; mt19937 gen(rd()); 9 exponential_distribution<double> exp1(1.0); // mean=1, var=1 10 11 const int n = 50; // samples per trial (increase to see stronger CLT) 12 const int trials = 10000; // number of trials 13 14 // We standardize: Z = sqrt(n) * (Xbar - mu) / sigma, with mu=1, sigma=1 for Exp(1) 15 vector<double> Z; Z.reserve(trials); 16 17 for (int t = 0; t < trials; ++t) { 18 double s = 0.0; 19 for (int i = 0; i < n; ++i) s += exp1(gen); 20 double xbar = s / n; 21 double z = sqrt((double)n) * (xbar - 1.0) / 1.0; 22 Z.push_back(z); 23 } 24 25 // Build histogram over [-4,4] with B bins 26 const int B = 20; 27 const double L = -4.0, R = 4.0; 28 vector<int> hist(B, 0); 29 for (double z : Z) { 30 if (z < L || z >= R) continue; // ignore out of range for display 31 int idx = (int)((z - L) / (R - L) * B); 32 if (0 <= idx && idx < B) hist[idx]++; 33 } 34 35 // Print ASCII histogram 36 int maxc = *max_element(hist.begin(), hist.end()); 37 cout << "Histogram of standardized means (approx N(0,1))\n"; 38 for (int b = 0; b < B; ++b) { 39 double a = L + (R - L) * b / B; 40 double c = L + (R - L) * (b + 1) / B; 41 int bar = (maxc > 0) ? (50 * hist[b] / maxc) : 0; 42 cout << fixed << setprecision(2) << "[" << setw(5) << a << ", " << setw(5) << c << ") "; 43 cout << string(bar, '#') << "\n"; 44 } 45 46 return 0; 47 } 48
We repeatedly take averages of n iid Exponential(1) samples, standardize them, and bin the results. By the central limit theorem, the histogram of standardized means approaches the bell curve N(0,1) as n and/or the number of trials grow. This demonstrates how non-Gaussian data (exponential) produce Gaussian-looking averages.