🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
∑MathIntermediate

Hypothesis Testing

Key Points

  • •
    Hypothesis testing is a decision-making process to evaluate claims about a population using sample data.
  • •
    You start with a null hypothesis (H0) that represents the status quo and an alternative hypothesis (H1) that represents a meaningful change or effect.
  • •
    A test statistic summarizes the evidence in your data, and a p-value measures how surprising your data would be if H0 were true.
  • •
    If the p-value is below a pre-chosen significance level α (like 0.05), you reject H0; otherwise, you fail to reject H0.
  • •
    Type I error (false positive) happens with probability α, while Type II error (false negative) has probability β; the test’s power is 1 − β.
  • •
    Different tests fit different situations: z-test for means with known variance, t-test for unknown variance, binomial tests for proportions, and permutation tests when assumptions fail.
  • •
    Two-sided tests look for differences in either direction; one-sided tests look for a difference in a specific direction.
  • •
    In code, hypothesis tests are mainly O(n) for scanning data, with exact combinatorial tests or Monte Carlo resampling trading time for fewer assumptions.

Prerequisites

  • →Basic probability — Understanding events, independence, and distributions is necessary to interpret p-values and sampling distributions.
  • →Descriptive statistics — Means, variances, and standard errors are core ingredients of test statistics.
  • →Normal distribution and CLT — z-tests and many approximations rely on normality or the Central Limit Theorem.
  • →Combinatorics (binomial coefficients) — Exact tests for proportions use binomial probabilities and combinations.
  • →Numerical methods in C++ — Stable computation of CDFs and probabilities requires functions like lgamma, erfc, and careful floating-point handling.

Detailed Explanation

Tap terms for definitions

01Overview

Hypothesis testing is a statistical framework for deciding whether observed data provide enough evidence to challenge a default belief about a population. The default belief is called the null hypothesis (H0), such as “the mean equals 10” or “two groups have the same average.” The competing claim is the alternative hypothesis (H1), which states that the parameter differs (two-sided) or is greater/less (one-sided). We compress our data into a test statistic—like a z-score or t-score—that measures how far the data are from what H0 predicts. Using the sampling distribution of that statistic under H0, we compute a p-value: the probability of seeing data as extreme as ours if H0 were true. If this probability is small (below a threshold α, often 0.05), we reject H0 in favor of H1; otherwise, we do not reject H0. This process helps control the frequency of false alarms (Type I errors) while aiming to detect real effects. Different tests apply in different scenarios, depending on assumptions like known variance, normality, independence, or sample size. When assumptions are doubtful, nonparametric or resampling-based tests (like permutation tests) provide robust alternatives. Hypothesis testing is widely used in A/B testing, quality control, medicine, and scientific research to make principled, quantifiable decisions from data.

02Intuition & Analogies

Imagine a courtroom. The defendant is presumed innocent (this is H0: no effect, no difference). The prosecutor presents evidence (your sample data). The jury asks: if the defendant were truly innocent (if H0 were true), how likely is it to see evidence this strong? That likelihood is the p-value. If it is very small, the jury concludes the evidence is too strong to be mere coincidence and returns a guilty verdict (reject H0). If it’s not small, the jury does not convict (fail to reject H0), though that doesn’t prove innocence—it only means the evidence isn’t strong enough. The significance level α is like the legal standard; a stricter standard (smaller α) makes convictions rarer but reduces false convictions (Type I errors). Conversely, there’s a risk of letting the guilty go free (Type II error) when real effects exist but the evidence appears weak, especially with small sample sizes. In a lab or business setting, H0 could be “our current process has a defect rate of 2%,” and H1 might be “the defect rate changed.” You take a sample and measure the defect count. A test statistic converts your data into a standardized score comparing what you saw to what H0 predicts. For a mean with known variability, that’s a z-score showing how many standard errors away your sample mean is from the claimed mean. If it’s far, observing it under H0 is unlikely, and you may reject H0. If assumptions like normality don’t hold, you can still compare groups by repeatedly shuffling labels (a permutation test) to see how unusual your observed difference is compared to what random chance would yield. This keeps the spirit of the courtroom analogy while reducing reliance on fragile assumptions.

03Formal Definition

Let X be data sampled from a distribution indexed by parameter θ. We partition the parameter space Θ into Θ0 (null) and Θ1 (alternative). We define a test statistic T(X) and a rejection region R such that the test rejects H0 when T(X) ∈ R. The test has size (significance level) α if supθ∈Θ0​ P_θ(T(X) ∈ R) ≤ α. For a realized sample x, the p-value is p(x) = inf{ α : x would fall in a size-α rejection region }, equivalently p(x) = Pθ∈Θ0​( T(X) is at least as extreme as T(x) ), with the direction of “extreme” depending on H1 (right-, left-, or two-sided). The power function is π(θ) = P_θ(T(X) ∈ R) for θ ∈ Θ1, and the Type II error probability at θ is β(θ) = 1 − π(θ). Classical parametric tests specify T and its null distribution using model assumptions (e.g., normality and known variance for z-tests, unknown variance for t-tests). Nonparametric and resampling procedures, like permutation tests, approximate the null distribution of T by randomization, often requiring only exchangeability under H0. In simple-vs-simple hypotheses, the Neyman–Pearson lemma provides the most powerful test at level α using likelihood ratios; for composite hypotheses, uniformly most powerful tests exist only in special families.

04When to Use

Use hypothesis testing when you need a yes/no decision about a claim backed by quantified uncertainty. Typical cases include A/B testing in software (does variant B improve conversion?), manufacturing quality control (has the defect rate changed?), scientific experiments (is a treatment effective?), and cybersecurity (has the mean response time anomalously increased?). Choose the test based on data type and assumptions: use a one-sample z-test for means when σ is known and samples are independent and approximately normal or large (by the Central Limit Theorem); use a one-sample t-test when σ is unknown but normality is plausible; use binomial or proportion tests for counts of successes; use chi-square for categorical independence or goodness-of-fit; and use permutation tests when distributional assumptions are questionable or sample sizes are small. Prefer two-sided tests if any difference (higher or lower) matters; use one-sided only with a clear directional hypothesis specified before seeing the data. If running many tests (multiple metrics or many variants), adjust for multiplicity (e.g., Bonferroni or Holm) to control the overall false positive rate. When estimation matters as much as detection, report confidence intervals and effect sizes alongside p-values to convey practical significance.

⚠️Common Mistakes

  1. Misinterpreting p-values: a p-value is not the probability H0 is true; it is the probability of observing data as extreme as yours if H0 were true. 2) Equating “fail to reject” with “accepting H0”: not rejecting simply means insufficient evidence, not proof of no effect. 3) Post-hoc switching between one- and two-sided tests: deciding tail direction after seeing the data inflates false positives. 4) Ignoring assumptions: using a z-test or t-test without checking independence, normality (for small n), or equal variances can yield misleading conclusions. 5) Multiple comparisons: testing many hypotheses at α = 0.05 virtually guarantees some false positives unless you adjust α. 6) Overreliance on statistical significance: a tiny p-value with a huge sample can correspond to a trivial effect size—always consider practical impact. 7) P-hacking and data snooping: repeated peeking, selective reporting, or trying many analyses until something “works” invalidates p-values. 8) Numerical pitfalls: computing exact binomial probabilities via naive factorials causes overflow/underflow; use log-gamma and log-sum-exp tricks. 9) Confusing confidence intervals and tests: for two-sided tests, rejecting at level α is equivalent to the null value lying outside the (1 − α) confidence interval; inconsistency often signals a mistake. 10) Using permutation tests incorrectly: ensure exchangeability under H0; maintain group sizes and use enough permutations or Monte Carlo replicates for stable p-values.

Key Formulas

One-sample z-statistic

z=σ/n​xˉ−μ0​​

Explanation: This standardizes the distance between the sample mean and the hypothesized mean when the population standard deviation is known. Under H0 and normality (or large n), z follows a standard normal distribution.

One-sample t-statistic

t=s/n​xˉ−μ0​​,df=n−1

Explanation: When the population standard deviation is unknown, replace it with the sample standard deviation s. Under H0 and normality, t follows a t-distribution with n − 1 degrees of freedom.

Chi-square statistic

χ2=i=1∑k​Ei​(Oi​−Ei​)2​

Explanation: Compares observed counts Oi​ to expected counts Ei​ across categories. Under H0 and large-sample conditions, this statistic follows a chi-square distribution with k − 1 degrees of freedom (adjusted for estimated parameters).

Right-tailed p-value

p-value=PH0​​(T(X)≥T(xobs​))

Explanation: For a right-tailed test, the p-value is the probability under the null of observing a test statistic at least as large as the observed one. For left- or two-tailed tests, the tail definition changes accordingly.

Binomial PMF

P(X=k)=(kn​)pk(1−p)n−k

Explanation: Gives the probability of exactly k successes in n independent Bernoulli trials with success probability p. This underlies the exact binomial test for proportions.

Exact binomial two-sided p-value

ptwo-sided​=i:P(X=i)≤P(X=kobs​)∑​P(X=i)

Explanation: The conservative two-sided p-value sums probabilities of outcomes with probability at most that of the observed count under H0. This ensures the test does not exceed the nominal α level.

Normal CDF via erfc

Φ(z)=21​erfc(−2​z​)

Explanation: The standard normal cumulative distribution function can be computed from the complementary error function. This relation allows numerically stable p-value computation in C++ using std::erfc.

Two-sided z-test p-value

ptwo-sided​=2(1−Φ(∣z∣))=erfc(2​∣z∣​)

Explanation: For a symmetric normal distribution, the two-sided p-value doubles the upper-tail probability beyond |z|. Using erfc avoids subtractive cancellation for large |z|.

Central Limit Theorem (standardized mean)

n→∞lim​σ/n​Xˉ−μ​d​N(0,1)

Explanation: As sample size grows, the standardized sample mean approaches a standard normal distribution. This justifies z-based inference even when underlying data are not exactly normal.

Bonferroni correction

α′=mα​

Explanation: When testing m hypotheses, using α′ per test controls the family-wise error rate at α. It is simple and conservative, guarding against multiple-comparison inflation.

Complexity Analysis

Computationally, most parametric hypothesis tests reduce to linear-time passes over the data to compute sufficient statistics (means, variances, counts). For a one-sample z-test or t-test, computing the sample mean and (optionally) variance requires O(n) time and O(1) extra space using streaming formulas; evaluating the p-value via a closed-form CDF (normal or t) is O(1). Chi-square tests for k categories take O(k) to accumulate counts and compute the statistic. Exact binomial tests require summing probabilities across 0..n; with a numerically stable implementation using log-gamma, this is O(n) time and O(1) space, though large n may need careful floating-point handling to avoid underflow. Resampling methods trade assumptions for computation. A permutation test with B random shuffles over two groups of sizes n1 and n2 typically costs O(B·(n1 + n2)) time and O(n1 + n2) space, since each shuffle recomputes the test statistic from scratch. Variance-reduction tricks (e.g., precomputing the combined sum and updating partial sums) keep the per-iteration cost linear. The Monte Carlo p-value’s standard error is approximately √(p(1 − p)/B), so achieving tighter precision requires larger B, increasing runtime proportionally. Parallelization is often straightforward because each shuffle is independent. Memory usage across all these methods is small—usually dominated by storing the input samples—since statistics can be computed in a single pass. Numerical stability is a practical consideration: prefer log-domain computations for combinatorial probabilities (binomial, hypergeometric) and use functions like erfc for normal tails to avoid catastrophic cancellation for extreme z-scores.

Code Examples

One-sample two-sided z-test for the mean (known σ)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute standard normal CDF using erfc (numerically stable)
5// Phi(z) = 0.5 * erfc(-z / sqrt(2))
6static inline double normal_cdf(double z) {
7 return 0.5 * erfc(-z / M_SQRT2); // M_SQRT2 = sqrt(2)
8}
9
10struct ZTestResult {
11 double mean;
12 double z;
13 double p_two_sided;
14};
15
16ZTestResult z_test_mean_known_sigma(const vector<double>& x, double mu0, double sigma) {
17 if (x.empty()) throw invalid_argument("Sample must be non-empty");
18 if (!(sigma > 0)) throw invalid_argument("Sigma must be positive");
19
20 // Compute sample mean in O(n)
21 double sum = accumulate(x.begin(), x.end(), 0.0);
22 double mean = sum / x.size();
23
24 // Standard error and z-statistic
25 double se = sigma / sqrt((double)x.size());
26 double z = (mean - mu0) / se;
27
28 // Two-sided p-value: p = 2*(1 - Phi(|z|)) = erfc(|z|/sqrt(2))
29 double p_two = erfc(fabs(z) / M_SQRT2);
30
31 return {mean, z, p_two};
32}
33
34int main() {
35 // Example: test H0: mu = 10 with known sigma = 2
36 vector<double> data = {9.5, 10.4, 9.8, 10.1, 10.7, 9.9, 10.3, 9.6};
37 double mu0 = 10.0;
38 double sigma = 2.0;
39
40 auto res = z_test_mean_known_sigma(data, mu0, sigma);
41
42 cout.setf(ios::fixed); cout << setprecision(6);
43 cout << "Sample mean = " << res.mean << "\n";
44 cout << "z-statistic = " << res.z << "\n";
45 cout << "Two-sided p-value = " << res.p_two_sided << "\n";
46
47 double alpha = 0.05;
48 cout << (res.p_two_sided < alpha ? "Reject H0 at alpha=0.05" : "Fail to reject H0 at alpha=0.05") << "\n";
49 return 0;
50}
51

Computes a two-sided z-test for the mean when the population standard deviation σ is known. It calculates the sample mean, standardizes it as a z-score, and uses the complementary error function to obtain a numerically stable p-value. This is appropriate for large samples or when σ is known from prior knowledge.

Time: O(n)Space: O(1) extra space (O(n) to store input)
Exact two-sided binomial test for a proportion
1#include <bits/stdc++.h>
2using namespace std;
3
4// Log of binomial PMF using lgamma for stability
5static inline long double log_binom_pmf(int n, int k, long double p) {
6 if (k < 0 || k > n) return -numeric_limits<long double>::infinity();
7 if (p == 0.0L) return (k == 0) ? 0.0L : -numeric_limits<long double>::infinity();
8 if (p == 1.0L) return (k == n) ? 0.0L : -numeric_limits<long double>::infinity();
9 long double lnC = lgammal(n + 1.0L) - lgammal(k + 1.0L) - lgammal((n - k) + 1.0L);
10 return lnC + k * logl(p) + (n - k) * logl(1.0L - p);
11}
12
13struct BinomialTestResult {
14 int n, k;
15 long double p0;
16 long double p_value_two_sided;
17};
18
19BinomialTestResult exact_binomial_test_two_sided(int n, int k_obs, long double p0) {
20 if (n <= 0) throw invalid_argument("n must be positive");
21 if (k_obs < 0 || k_obs > n) throw invalid_argument("k must be in [0,n]");
22 if (!(p0 > 0.0L && p0 < 1.0L)) throw invalid_argument("p0 must be in (0,1)");
23
24 long double logp_obs = log_binom_pmf(n, k_obs, p0);
25
26 // Conservative two-sided p-value: sum probabilities with pmf <= pmf_observed
27 long double p_sum = 0.0L;
28 for (int k = 0; k <= n; ++k) {
29 long double lp = log_binom_pmf(n, k, p0);
30 if (lp <= logp_obs + 1e-18L) { // tiny tolerance
31 p_sum += expl(lp);
32 }
33 }
34
35 // Clamp due to numerical rounding
36 p_sum = min<long double>(1.0L, max<long double>(0.0L, p_sum));
37 return {n, k_obs, p0, p_sum};
38}
39
40int main() {
41 // Example: 20 trials, 7 successes, test H0: p = 0.25
42 int n = 20, k = 7; long double p0 = 0.25L;
43 auto res = exact_binomial_test_two_sided(n, k, p0);
44
45 cout.setf(ios::fixed); cout << setprecision(10);
46 cout << "n = " << res.n << ", k = " << res.k << ", p0 = " << (double)res.p0 << "\n";
47 cout << "Two-sided exact p-value = " << (double)res.p_value_two_sided << "\n";
48
49 double alpha = 0.05;
50 cout << (res.p_value_two_sided < alpha ? "Reject H0" : "Fail to reject H0") << " at alpha=0.05\n";
51 return 0;
52}
53

Implements the exact binomial test for a proportion by summing probabilities of outcomes at least as unlikely as the observed count under H0. It uses log-gamma for numerical stability, then exponentiates to sum in probability space. This is accurate for small and moderate n and does not rely on normal approximations.

Time: O(n) to scan all k from 0 to nSpace: O(1)
Permutation test for difference in means (two-sided, Monte Carlo)
1#include <bits/stdc++.h>
2using namespace std;
3
4struct PermTestResult {
5 double observed_abs_diff;
6 double p_value_two_sided;
7};
8
9static double mean_of(const vector<double>& v) {
10 return accumulate(v.begin(), v.end(), 0.0) / max<size_t>(1, v.size());
11}
12
13PermTestResult permutation_test_diff_means(const vector<double>& A, const vector<double>& B, int B_iters, uint64_t seed = 1234567ULL) {
14 if (A.empty() || B.empty()) throw invalid_argument("Both groups must be non-empty");
15 if (B_iters <= 0) throw invalid_argument("Number of permutations must be positive");
16
17 // Observed statistic: absolute difference in means
18 double obs = fabs(mean_of(A) - mean_of(B));
19
20 // Combine data
21 vector<double> all; all.reserve(A.size() + B.size());
22 all.insert(all.end(), A.begin(), A.end());
23 all.insert(all.end(), B.begin(), B.end());
24
25 size_t n1 = A.size(); size_t n2 = B.size();
26
27 // Random generator
28 mt19937_64 rng(seed);
29
30 int extreme = 0;
31 vector<double> tmp = all; // will be shuffled in-place
32
33 for (int it = 0; it < B_iters; ++it) {
34 shuffle(tmp.begin(), tmp.end(), rng);
35
36 // Compute means of the first n1 and remaining n2 elements
37 double sum1 = 0.0; for (size_t i = 0; i < n1; ++i) sum1 += tmp[i];
38 double sum2 = 0.0; for (size_t i = n1; i < n1 + n2; ++i) sum2 += tmp[i];
39 double diff = fabs((sum1 / n1) - (sum2 / n2));
40
41 if (diff >= obs - 1e-15) ++extreme; // count as or more extreme
42 }
43
44 // Monte Carlo p-value with +1 correction for stability
45 double p = (extreme + 1.0) / (B_iters + 1.0);
46 return {obs, p};
47}
48
49int main() {
50 vector<double> A = {10.1, 9.9, 10.3, 10.0, 9.8};
51 vector<double> B = {10.7, 10.5, 10.6, 10.4};
52
53 int permutations = 20000; // increase for more precision
54 auto res = permutation_test_diff_means(A, B, permutations, 42);
55
56 cout.setf(ios::fixed); cout << setprecision(6);
57 cout << "Observed |mean(A) - mean(B)| = " << res.observed_abs_diff << "\n";
58 cout << "Two-sided permutation p-value (Monte Carlo) = " << res.p_value_two_sided << "\n";
59
60 double alpha = 0.05;
61 cout << (res.p_value_two_sided < alpha ? "Reject H0 (equal means)" : "Fail to reject H0 (equal means)") << " at alpha=0.05\n";
62 return 0;
63}
64

Performs a two-sided permutation test comparing means of two groups by shuffling labels and recomputing the difference in means. This approach requires only exchangeability under H0 and is robust when normality or equal-variance assumptions are doubtful. The Monte Carlo p-value converges to the true permutation p-value as the number of shuffles increases.

Time: O(B · (n1 + n2)) where B is the number of permutationsSpace: O(n1 + n2)
#hypothesis testing#null hypothesis#alternative hypothesis#p-value#significance level#z-test#t-test#binomial test#permutation test#type i error#type ii error#power#confidence interval#multiple testing#ab testing