Law of Large Numbers
Key Points
- •The Weak Law of Large Numbers (WLLN) says that the sample average of independent, identically distributed (i.i.d.) random variables with finite mean gets close to the true mean with high probability as the sample size grows.
- •Formally, for i.i.d. , , … with mean the sample mean \bar converges in probability to written as \bar
- •A practical takeaway is that averaging reduces noise: the variance of the sample mean is so doubling samples halves the variance.
- •Chebyshev’s inequality gives a simple bound: P(|\bar − ≥ ≤ which goes to 0 as n increases.
- •For bounded data, sharper concentration (e.g., Hoeffding’s inequality) yields exponentially small error probabilities in n.
- •The WLLN differs from the Strong Law, which guarantees almost sure convergence; WLLN guarantees convergence in probability only.
- •The law can fail for heavy-tailed distributions with infinite mean, such as the Cauchy distribution; then averages may not stabilize.
- •In algorithms and data science, WLLN justifies Monte Carlo estimation and choosing sample sizes to achieve target accuracy with high probability.
Prerequisites
- →Random variables and distributions — Understanding what i.i.d. samples, means, and variances are is essential to interpreting LLN statements.
- →Expectation and variance — LLN is about the convergence of sample averages to the expected value and uses variance to bound deviations.
- →Modes of convergence — Distinguishing convergence in probability from almost sure convergence clarifies the WLLN versus SLLN.
- →Markov and Chebyshev inequalities — Provide simple, distribution-free bounds used in elementary proofs and sample-size calculations.
- →Limits and epsilon-delta reasoning — Convergence in probability is defined via limits with ε thresholds.
Detailed Explanation
Tap terms for definitions01Overview
The Law of Large Numbers (LLN) captures a fundamental stabilization effect of averages. When we observe many independent repetitions of the same random experiment—say, rolling a fair die or measuring sensor noise—and compute their average, this average tends to approach the true underlying expected value. The Weak Law of Large Numbers (WLLN) formalizes this by stating that the sample mean converges in probability to the true mean μ. In practical terms, for any tolerance ε you care about, the chance that the sample mean differs from μ by more than ε can be made arbitrarily small by taking enough samples. This matters across statistics, algorithms, and engineering. In Monte Carlo methods, we approximate expectations by averaging random samples. In quality control, we aggregate measurements to estimate a production line’s true defect rate. In machine learning, we often treat empirical risk (an average loss over data) as a proxy for expected risk. The WLLN requires that the random variables are independent and identically distributed and have a finite mean. A common sufficient condition used in elementary proofs is finite variance, enabling a direct application of Chebyshev’s inequality. While the WLLN ensures probabilistic convergence, it does not say that every single long run must be close (that’s the Strong Law); it guarantees that the probability of a large deviation goes to zero as the sample size grows.
02Intuition & Analogies
Imagine listening to a noisy radio station. Any single moment’s sound may be distorted, but if you record and average many short clips, the persistent signal emerges while random crackles cancel out. That’s the LLN: random ups and downs balance out when we gather enough independent glimpses of the same phenomenon. Consider coin flips. After 10 flips, getting 8 heads isn’t shocking. After 10,000 flips, getting 8,000 heads would be astonishing. The proportion of heads gravitates toward 0.5 as you flip more, because extreme imbalances become rarer. Averaging acts like a stabilizer: each new observation nudges the mean slightly, and because nudges are as likely to push up as down, they tend to cancel. Think of a classroom where each student estimates the number of candies in a jar. Individual guesses vary widely, but the class average is usually quite close to the truth. Or consider measuring room temperature with a cheap thermometer that’s jittery: one reading is unreliable, but the average of many settles near the true temperature. A mechanical analogy: picture a mass attached to a spring at the true mean μ. Random forces jostle the mass, but friction (averaging) damps motion; with more time (more samples), the mass lingers near μ. The WLLN’s promise is not that the average marches monotonically to μ, but that the probability of it straying far from μ becomes tiny as the sample count grows. For bounded-noise systems, the pull toward μ strengthens swiftly (exponential concentration), while for general finite-variance noise, the pull is slower but still inevitable.
03Formal Definition
04When to Use
- Monte Carlo estimation: To approximate an expectation \mathbb{E}[f(Z)], generate i.i.d. samples Z_i and average f(Z_i). The WLLN ensures the estimator \bar Y_n = \frac{1}{n}\sum f(Z_i) becomes accurate with high probability as n grows.
- Setting sample sizes: With variance estimates or boundedness assumptions, use Chebyshev’s or Hoeffding’s bounds to choose n for a target accuracy ε and confidence 1−δ.
- Quality control and A/B testing: Proportions (e.g., defect rate, click-through rate) are sample means of Bernoulli variables. The WLLN justifies that observed rates stabilize near the true rate with enough data.
- Streaming analytics: When data arrive continuously, maintain a running mean to reduce noise and estimate a stable baseline. The WLLN supports trusting the running average as it accumulates samples.
- Algorithm analysis: In randomized algorithms, empirical performance metrics often average random costs; the WLLN supports using long-run averages as consistent performance estimates. Avoid relying on the WLLN when data are dependent (strong autocorrelation), not identically distributed (drifting distributions), or heavy-tailed with infinite mean (e.g., Cauchy). In such cases, convergence can fail or be misleading without additional conditions (mixing, stationarity, truncation, or robust estimators).
⚠️Common Mistakes
- Confusing WLLN with the Strong Law: WLLN gives convergence in probability, not almost sure convergence. It does not guarantee that one particular long run must stabilize, only that large deviations become unlikely for large n.
- Ignoring assumptions: Independence and identical distribution matter. Time series with autocorrelation or concept drift can violate LLN conditions; sample means may converge to biased limits or fail to stabilize.
- Overlooking finite-mean requirement: Heavy-tailed distributions with infinite mean (e.g., Cauchy) can defeat LLN. Averages of Cauchy samples need not converge and can be wildly unstable even for large n.
- Expecting deterministic error rates: Chebyshev’s bound is loose and decays as 1/n, while bounded-variable inequalities (Hoeffding) decay exponentially. Using the wrong bound leads to poor sample-size planning.
- Misapplying CLT: The Central Limit Theorem refines fluctuation shape, not convergence itself. For small n or heavy tails, normal approximations can be inaccurate; LLN remains valid (under its conditions) but does not prescribe a normal error profile.
- Numerical accumulation errors: Storing all samples and summing naively can cause overflow or precision loss. Use streaming updates and double precision; for variance estimates, prefer numerically stable algorithms (e.g., Welford).
Key Formulas
Sample Mean
Explanation: The average of n i.i.d. observations. It is the basic estimator of the population mean and is central to the LLN.
Weak LLN (Convergence in Probability)
Explanation: This states that for any fixed tolerance the probability that the sample mean deviates from the true mean by more than ε goes to zero as the sample size grows.
Variance of the Sample Mean
Explanation: If are i.i.d. with variance then averaging n of them reduces variance by a factor of n. This quantifies noise reduction by averaging.
Chebyshev’s Inequality
Explanation: For any random variable with finite variance, the chance of deviating from the mean by ε or more is bounded by variance divided by It is distribution-free but often loose.
Chebyshev for the Mean
Explanation: Applying Chebyshev’s inequality to the sample mean with variance yields a 1/n decay of large-deviation probability.
Hoeffding’s Inequality (Bounded Variables)
Explanation: If each lies in [a,b], then deviations of the sample mean from its mean decay exponentially in n. This is tighter than Chebyshev when boundedness holds.
Sample Size via Hoeffding
Explanation: To ensure P() ≤ δ for bounded data in [a,b], choose n at least this large. It provides practical sizing for experiments.
Convergence in Probability (Definition)
Explanation: This is the general definition used in WLLN: the sequence converges in probability to X if large deviations vanish in probability for every ε > 0.
Central Limit Theorem (Context)
Explanation: Although separate from the LLN, the CLT explains how the fluctuations around the mean behave asymptotically: they approach a normal distribution when scaled by .
Markov’s Inequality
Explanation: A basic tail bound for nonnegative variables, often a starting point to derive other inequalities like Chebyshev when applied to squared deviations.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Empirically estimate P(|\bar X_n - p| >= eps) for Bernoulli(p) over M trials 5 pair<double,double> empirical_error_and_mean(size_t n, size_t M, double p, double eps, mt19937_64 &rng) { 6 bernoulli_distribution bern(p); 7 size_t count_bad = 0; 8 double mean_of_means = 0.0; 9 for (size_t t = 0; t < M; ++t) { 10 // Streaming sum to avoid storing samples 11 size_t ones = 0; 12 for (size_t i = 0; i < n; ++i) { 13 ones += bern(rng) ? 1 : 0; 14 } 15 double xbar = static_cast<double>(ones) / static_cast<double>(n); 16 mean_of_means += xbar; 17 if (fabs(xbar - p) >= eps) ++count_bad; 18 } 19 mean_of_means /= static_cast<double>(M); 20 double empirical_prob = static_cast<double>(count_bad) / static_cast<double>(M); 21 return {empirical_prob, mean_of_means}; 22 } 23 24 int main() { 25 ios::sync_with_stdio(false); 26 cin.tie(nullptr); 27 28 // Parameters 29 double p = 0.6; // Bernoulli mean 30 double eps = 0.05; // Deviation threshold 31 vector<size_t> ns = {200, 1000, 5000}; 32 size_t M = 2000; // Number of Monte Carlo replications per n 33 34 random_device rd; 35 mt19937_64 rng(rd()); 36 37 cout.setf(ios::fixed); cout << setprecision(6); 38 cout << "Bernoulli(p) WLLN validation with p=" << p << ", eps=" << eps << ", M=" << M << "\n"; 39 cout << "n, empirical P(|xbar-p|>=eps), Chebyshev, Hoeffding, mean(xbar)\n"; 40 41 double sigma2 = p * (1.0 - p); // Variance of Bernoulli(p) 42 double a = 0.0, b = 1.0; // Bounds for Hoeffding 43 44 for (size_t n : ns) { 45 auto [emp_prob, mean_xbar] = empirical_error_and_mean(n, M, p, eps, rng); 46 double chebyshev = sigma2 / (static_cast<double>(n) * eps * eps); 47 double hoeffding = 2.0 * exp(-2.0 * static_cast<double>(n) * eps * eps / ((b - a) * (b - a))); 48 cout << n << ", " << emp_prob << ", " << min(1.0, chebyshev) << ", " << min(1.0, hoeffding) << ", " << mean_xbar << "\n"; 49 } 50 51 return 0; 52 } 53
We flip a biased coin (Bernoulli(p)) n times and compute the sample mean over M independent replications. The empirical frequency of large deviations |\bar X_n − p| ≥ ε illustrates convergence in probability as n increases. We compare this to Chebyshev’s 1/n bound and Hoeffding’s exponential bound for bounded variables. The mean of the sample means should be close to p, reflecting unbiasedness.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 int main() { 5 ios::sync_with_stdio(false); 6 cin.tie(nullptr); 7 8 // True parameters 9 const double mu = 5.0; 10 const double sigma = 2.0; 11 12 random_device rd; 13 mt19937_64 rng(rd()); 14 normal_distribution<double> norm(mu, sigma); 15 16 // Streaming mean update: mean_k = mean_{k-1} + (x_k - mean_{k-1}) / k 17 const size_t N = 200000; // total samples 18 double mean = 0.0; 19 20 cout.setf(ios::fixed); cout << setprecision(6); 21 cout << "k, running_mean, abs_error\n"; 22 for (size_t k = 1; k <= N; ++k) { 23 double x = norm(rng); 24 mean += (x - mean) / static_cast<double>(k); 25 26 if (k <= 10 || k == 100 || k == 1000 || k == 10000 || k == 100000 || k == N) { 27 cout << k << ", " << mean << ", " << fabs(mean - mu) << "\n"; 28 } 29 } 30 31 // Note: The absolute error generally shrinks as k grows, consistent with WLLN. 32 return 0; 33 } 34
This program generates Normal(μ, σ²) samples and maintains a running average using the O(1) online update formula. Without storing data, we observe that the running mean tends to get closer to the true μ as k grows, demonstrating the WLLN in a streaming setting. Selected checkpoints show the decreasing absolute error.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 int main() { 5 ios::sync_with_stdio(false); 6 cin.tie(nullptr); 7 8 random_device rd; 9 mt19937_64 rng(rd()); 10 cauchy_distribution<double> cauchy(0.0, 1.0); // location=0, scale=1 11 12 vector<size_t> ns = {100, 1000, 10000, 100000}; 13 14 cout.setf(ios::fixed); cout << setprecision(6); 15 cout << "n, sample_mean\n"; 16 for (size_t n : ns) { 17 long double sum = 0.0L; 18 for (size_t i = 0; i < n; ++i) { 19 double x = cauchy(rng); 20 sum += x; 21 } 22 double xbar = static_cast<double>(sum / static_cast<long double>(n)); 23 cout << n << ", " << xbar << "\n"; 24 } 25 26 // Expect sample means to be erratic, even for large n, since the Cauchy distribution has no finite mean. 27 return 0; 28 } 29
The standard Cauchy distribution has no finite mean or variance. This violates the LLN’s assumptions, and sample averages need not stabilize as n increases. The code demonstrates that \bar X_n can remain highly unstable, warning against naive averaging under heavy tails.