🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
∑MathIntermediate

Law of Large Numbers

Key Points

  • •
    The Weak Law of Large Numbers (WLLN) says that the sample average of independent, identically distributed (i.i.d.) random variables with finite mean gets close to the true mean with high probability as the sample size grows.
  • •
    Formally, for i.i.d. X1​, X2​, … with mean μ, the sample mean \bar Xn​ converges in probability to μ, written as \bar Xn​ P​ μ.
  • •
    A practical takeaway is that averaging reduces noise: the variance of the sample mean is σ2/n, so doubling samples halves the variance.
  • •
    Chebyshev’s inequality gives a simple bound: P(|\bar Xn​−μ∣ ≥ ε) ≤ σ2/(nε2), which goes to 0 as n increases.
  • •
    For bounded data, sharper concentration (e.g., Hoeffding’s inequality) yields exponentially small error probabilities in n.
  • •
    The WLLN differs from the Strong Law, which guarantees almost sure convergence; WLLN guarantees convergence in probability only.
  • •
    The law can fail for heavy-tailed distributions with infinite mean, such as the Cauchy distribution; then averages may not stabilize.
  • •
    In algorithms and data science, WLLN justifies Monte Carlo estimation and choosing sample sizes to achieve target accuracy with high probability.

Prerequisites

  • →Random variables and distributions — Understanding what i.i.d. samples, means, and variances are is essential to interpreting LLN statements.
  • →Expectation and variance — LLN is about the convergence of sample averages to the expected value and uses variance to bound deviations.
  • →Modes of convergence — Distinguishing convergence in probability from almost sure convergence clarifies the WLLN versus SLLN.
  • →Markov and Chebyshev inequalities — Provide simple, distribution-free bounds used in elementary proofs and sample-size calculations.
  • →Limits and epsilon-delta reasoning — Convergence in probability is defined via limits with ε thresholds.

Detailed Explanation

Tap terms for definitions

01Overview

The Law of Large Numbers (LLN) captures a fundamental stabilization effect of averages. When we observe many independent repetitions of the same random experiment—say, rolling a fair die or measuring sensor noise—and compute their average, this average tends to approach the true underlying expected value. The Weak Law of Large Numbers (WLLN) formalizes this by stating that the sample mean converges in probability to the true mean μ. In practical terms, for any tolerance ε you care about, the chance that the sample mean differs from μ by more than ε can be made arbitrarily small by taking enough samples. This matters across statistics, algorithms, and engineering. In Monte Carlo methods, we approximate expectations by averaging random samples. In quality control, we aggregate measurements to estimate a production line’s true defect rate. In machine learning, we often treat empirical risk (an average loss over data) as a proxy for expected risk. The WLLN requires that the random variables are independent and identically distributed and have a finite mean. A common sufficient condition used in elementary proofs is finite variance, enabling a direct application of Chebyshev’s inequality. While the WLLN ensures probabilistic convergence, it does not say that every single long run must be close (that’s the Strong Law); it guarantees that the probability of a large deviation goes to zero as the sample size grows.

02Intuition & Analogies

Imagine listening to a noisy radio station. Any single moment’s sound may be distorted, but if you record and average many short clips, the persistent signal emerges while random crackles cancel out. That’s the LLN: random ups and downs balance out when we gather enough independent glimpses of the same phenomenon. Consider coin flips. After 10 flips, getting 8 heads isn’t shocking. After 10,000 flips, getting 8,000 heads would be astonishing. The proportion of heads gravitates toward 0.5 as you flip more, because extreme imbalances become rarer. Averaging acts like a stabilizer: each new observation nudges the mean slightly, and because nudges are as likely to push up as down, they tend to cancel. Think of a classroom where each student estimates the number of candies in a jar. Individual guesses vary widely, but the class average is usually quite close to the truth. Or consider measuring room temperature with a cheap thermometer that’s jittery: one reading is unreliable, but the average of many settles near the true temperature. A mechanical analogy: picture a mass attached to a spring at the true mean μ. Random forces jostle the mass, but friction (averaging) damps motion; with more time (more samples), the mass lingers near μ. The WLLN’s promise is not that the average marches monotonically to μ, but that the probability of it straying far from μ becomes tiny as the sample count grows. For bounded-noise systems, the pull toward μ strengthens swiftly (exponential concentration), while for general finite-variance noise, the pull is slower but still inevitable.

03Formal Definition

Let {Xn​}_{n≥ 1} be i.i.d. random variables with finite mean μ = E[X1​]. Define the sample mean \bar Xn​ = n1​ ∑i=1n​ Xi​. The Weak Law of Large Numbers states that for every ε > 0, \[ limn→∞​ P(∣Xˉn​−μ∣ > ε) = 0. \] This is written compactly as \bar Xn​ P​ μ, meaning convergence in probability. A common sufficient condition enabling a short proof is finite variance: if Var(X1​)=σ2<∞, then Var(\bar Xn​)=σ2/n and Chebyshev’s inequality implies \[ P(∣Xˉn​−μ∣ ≥ ε) ≤ nε2σ2​ → 0. \] More generally, the WLLN holds under weaker assumptions (e.g., E∣X1​∣<∞) without requiring finite variance, but the proof is more involved. Convergence in probability differs from almost sure (a.s.) convergence: WLLN guarantees that for any fixed error tolerance, the chance of large deviations eventually becomes negligible, but it does not assert that the realized sequence \bar Xn​(ω) converges to μ for almost every outcome ω (that is the Strong Law).

04When to Use

  • Monte Carlo estimation: To approximate an expectation \mathbb{E}[f(Z)], generate i.i.d. samples Z_i and average f(Z_i). The WLLN ensures the estimator \bar Y_n = \frac{1}{n}\sum f(Z_i) becomes accurate with high probability as n grows.
  • Setting sample sizes: With variance estimates or boundedness assumptions, use Chebyshev’s or Hoeffding’s bounds to choose n for a target accuracy ε and confidence 1−δ.
  • Quality control and A/B testing: Proportions (e.g., defect rate, click-through rate) are sample means of Bernoulli variables. The WLLN justifies that observed rates stabilize near the true rate with enough data.
  • Streaming analytics: When data arrive continuously, maintain a running mean to reduce noise and estimate a stable baseline. The WLLN supports trusting the running average as it accumulates samples.
  • Algorithm analysis: In randomized algorithms, empirical performance metrics often average random costs; the WLLN supports using long-run averages as consistent performance estimates. Avoid relying on the WLLN when data are dependent (strong autocorrelation), not identically distributed (drifting distributions), or heavy-tailed with infinite mean (e.g., Cauchy). In such cases, convergence can fail or be misleading without additional conditions (mixing, stationarity, truncation, or robust estimators).

⚠️Common Mistakes

  • Confusing WLLN with the Strong Law: WLLN gives convergence in probability, not almost sure convergence. It does not guarantee that one particular long run must stabilize, only that large deviations become unlikely for large n.
  • Ignoring assumptions: Independence and identical distribution matter. Time series with autocorrelation or concept drift can violate LLN conditions; sample means may converge to biased limits or fail to stabilize.
  • Overlooking finite-mean requirement: Heavy-tailed distributions with infinite mean (e.g., Cauchy) can defeat LLN. Averages of Cauchy samples need not converge and can be wildly unstable even for large n.
  • Expecting deterministic error rates: Chebyshev’s bound is loose and decays as 1/n, while bounded-variable inequalities (Hoeffding) decay exponentially. Using the wrong bound leads to poor sample-size planning.
  • Misapplying CLT: The Central Limit Theorem refines fluctuation shape, not convergence itself. For small n or heavy tails, normal approximations can be inaccurate; LLN remains valid (under its conditions) but does not prescribe a normal error profile.
  • Numerical accumulation errors: Storing all samples and summing naively can cause overflow or precision loss. Use streaming updates and double precision; for variance estimates, prefer numerically stable algorithms (e.g., Welford).

Key Formulas

Sample Mean

Xˉn​=n1​i=1∑n​Xi​

Explanation: The average of n i.i.d. observations. It is the basic estimator of the population mean and is central to the LLN.

Weak LLN (Convergence in Probability)

n→∞lim​P(∣Xˉn​−μ∣>ε)=0

Explanation: This states that for any fixed tolerance ε, the probability that the sample mean deviates from the true mean by more than ε goes to zero as the sample size grows.

Variance of the Sample Mean

Var(Xˉn​)=nσ2​

Explanation: If Xi​ are i.i.d. with variance σ2, then averaging n of them reduces variance by a factor of n. This quantifies noise reduction by averaging.

Chebyshev’s Inequality

P(∣X−μ∣≥ε)≤ε2Var(X)​

Explanation: For any random variable with finite variance, the chance of deviating from the mean by ε or more is bounded by variance divided by ε2. It is distribution-free but often loose.

Chebyshev for the Mean

P(∣Xˉn​−μ∣≥ε)≤nε2σ2​

Explanation: Applying Chebyshev’s inequality to the sample mean with variance σ2/n yields a 1/n decay of large-deviation probability.

Hoeffding’s Inequality (Bounded Variables)

P(∣Xˉn​−μ∣≥ε)≤2exp(−(b−a)22nε2​)

Explanation: If each Xi​ lies in [a,b], then deviations of the sample mean from its mean decay exponentially in n. This is tighter than Chebyshev when boundedness holds.

Sample Size via Hoeffding

n≥2ε2(b−a)2​ln(δ2​)

Explanation: To ensure P(∣Xˉn​−μ∣≥ε) ≤ δ for bounded data in [a,b], choose n at least this large. It provides practical sizing for experiments.

Convergence in Probability (Definition)

n→∞lim​P(∣Xn​−X∣>ε)=0

Explanation: This is the general definition used in WLLN: the sequence Xn​ converges in probability to X if large deviations vanish in probability for every ε > 0.

Central Limit Theorem (Context)

n​σXˉn​−μ​d​N(0,1)

Explanation: Although separate from the LLN, the CLT explains how the fluctuations around the mean behave asymptotically: they approach a normal distribution when scaled by n​.

Markov’s Inequality

P(X≥t)≤tE[X]​,X≥0

Explanation: A basic tail bound for nonnegative variables, often a starting point to derive other inequalities like Chebyshev when applied to squared deviations.

Complexity Analysis

Simulating and verifying the Weak Law of Large Numbers computationally primarily involves generating random samples and maintaining running sums. For a single experiment that computes a sample mean from n i.i.d. observations, the time complexity is O(n): each sample requires a constant-time generation step and an O(1) update of the running total. The space complexity is O(1) if we maintain only the running sum (and perhaps a running mean/variance), because we do not need to store all samples. When estimating probabilities such as P(|\bar Xn​ − μ∣ ≥ ε) empirically, we typically repeat the experiment M times. This multiplies the time cost by M, yielding O(Mn) time and O(1) additional space per replicate (again, storing only aggregates). If we track results across multiple n values (say, k different sample sizes), the total time is O(M ∑j=1k​ nj​), with negligible space overhead beyond simple counters. For bounded variables, evaluating theoretical bounds like Chebyshev (σ2/(nε2)) and Hoeffding (2e−2nε2/(b−a)2) is O(1) per n. Numerical concerns include floating-point precision when summing many values; using double precision and a streaming update mitigates round-off issues. If variance is also estimated (e.g., Welford’s algorithm), time remains O(n) and space O(1). In streaming contexts, where data arrive continuously and we maintain a running mean, each new sample updates the estimate in O(1) time with O(1) memory. Overall, verifying LLN behavior scales linearly with the number of samples processed, and memory usage is constant, making such simulations efficient even for large n (subject to random number generator throughput).

Code Examples

Validate WLLN on Bernoulli(p): empirical error vs Chebyshev and Hoeffding bounds
1#include <bits/stdc++.h>
2using namespace std;
3
4// Empirically estimate P(|\bar X_n - p| >= eps) for Bernoulli(p) over M trials
5pair<double,double> empirical_error_and_mean(size_t n, size_t M, double p, double eps, mt19937_64 &rng) {
6 bernoulli_distribution bern(p);
7 size_t count_bad = 0;
8 double mean_of_means = 0.0;
9 for (size_t t = 0; t < M; ++t) {
10 // Streaming sum to avoid storing samples
11 size_t ones = 0;
12 for (size_t i = 0; i < n; ++i) {
13 ones += bern(rng) ? 1 : 0;
14 }
15 double xbar = static_cast<double>(ones) / static_cast<double>(n);
16 mean_of_means += xbar;
17 if (fabs(xbar - p) >= eps) ++count_bad;
18 }
19 mean_of_means /= static_cast<double>(M);
20 double empirical_prob = static_cast<double>(count_bad) / static_cast<double>(M);
21 return {empirical_prob, mean_of_means};
22}
23
24int main() {
25 ios::sync_with_stdio(false);
26 cin.tie(nullptr);
27
28 // Parameters
29 double p = 0.6; // Bernoulli mean
30 double eps = 0.05; // Deviation threshold
31 vector<size_t> ns = {200, 1000, 5000};
32 size_t M = 2000; // Number of Monte Carlo replications per n
33
34 random_device rd;
35 mt19937_64 rng(rd());
36
37 cout.setf(ios::fixed); cout << setprecision(6);
38 cout << "Bernoulli(p) WLLN validation with p=" << p << ", eps=" << eps << ", M=" << M << "\n";
39 cout << "n, empirical P(|xbar-p|>=eps), Chebyshev, Hoeffding, mean(xbar)\n";
40
41 double sigma2 = p * (1.0 - p); // Variance of Bernoulli(p)
42 double a = 0.0, b = 1.0; // Bounds for Hoeffding
43
44 for (size_t n : ns) {
45 auto [emp_prob, mean_xbar] = empirical_error_and_mean(n, M, p, eps, rng);
46 double chebyshev = sigma2 / (static_cast<double>(n) * eps * eps);
47 double hoeffding = 2.0 * exp(-2.0 * static_cast<double>(n) * eps * eps / ((b - a) * (b - a)));
48 cout << n << ", " << emp_prob << ", " << min(1.0, chebyshev) << ", " << min(1.0, hoeffding) << ", " << mean_xbar << "\n";
49 }
50
51 return 0;
52}
53

We flip a biased coin (Bernoulli(p)) n times and compute the sample mean over M independent replications. The empirical frequency of large deviations |\bar X_n − p| ≥ ε illustrates convergence in probability as n increases. We compare this to Chebyshev’s 1/n bound and Hoeffding’s exponential bound for bounded variables. The mean of the sample means should be close to p, reflecting unbiasedness.

Time: O(M n)Space: O(1)
Streaming mean converges for Normal(μ, σ²): online update without storing data
1#include <bits/stdc++.h>
2using namespace std;
3
4int main() {
5 ios::sync_with_stdio(false);
6 cin.tie(nullptr);
7
8 // True parameters
9 const double mu = 5.0;
10 const double sigma = 2.0;
11
12 random_device rd;
13 mt19937_64 rng(rd());
14 normal_distribution<double> norm(mu, sigma);
15
16 // Streaming mean update: mean_k = mean_{k-1} + (x_k - mean_{k-1}) / k
17 const size_t N = 200000; // total samples
18 double mean = 0.0;
19
20 cout.setf(ios::fixed); cout << setprecision(6);
21 cout << "k, running_mean, abs_error\n";
22 for (size_t k = 1; k <= N; ++k) {
23 double x = norm(rng);
24 mean += (x - mean) / static_cast<double>(k);
25
26 if (k <= 10 || k == 100 || k == 1000 || k == 10000 || k == 100000 || k == N) {
27 cout << k << ", " << mean << ", " << fabs(mean - mu) << "\n";
28 }
29 }
30
31 // Note: The absolute error generally shrinks as k grows, consistent with WLLN.
32 return 0;
33}
34

This program generates Normal(μ, σ²) samples and maintains a running average using the O(1) online update formula. Without storing data, we observe that the running mean tends to get closer to the true μ as k grows, demonstrating the WLLN in a streaming setting. Selected checkpoints show the decreasing absolute error.

Time: O(N)Space: O(1)
Pitfall: Cauchy samples defy LLN (no finite mean)
1#include <bits/stdc++.h>
2using namespace std;
3
4int main() {
5 ios::sync_with_stdio(false);
6 cin.tie(nullptr);
7
8 random_device rd;
9 mt19937_64 rng(rd());
10 cauchy_distribution<double> cauchy(0.0, 1.0); // location=0, scale=1
11
12 vector<size_t> ns = {100, 1000, 10000, 100000};
13
14 cout.setf(ios::fixed); cout << setprecision(6);
15 cout << "n, sample_mean\n";
16 for (size_t n : ns) {
17 long double sum = 0.0L;
18 for (size_t i = 0; i < n; ++i) {
19 double x = cauchy(rng);
20 sum += x;
21 }
22 double xbar = static_cast<double>(sum / static_cast<long double>(n));
23 cout << n << ", " << xbar << "\n";
24 }
25
26 // Expect sample means to be erratic, even for large n, since the Cauchy distribution has no finite mean.
27 return 0;
28}
29

The standard Cauchy distribution has no finite mean or variance. This violates the LLN’s assumptions, and sample averages need not stabilize as n increases. The code demonstrates that \bar X_n can remain highly unstable, warning against naive averaging under heavy tails.

Time: O(n) per listed n (overall O(\sum n))Space: O(1)
#law of large numbers#weak law#sample mean#convergence in probability#chebyshev inequality#hoeffding inequality#monte carlo#iid#heavy-tailed#cauchy distribution#central limit theorem#variance reduction#streaming mean#bounded variables#probability bounds