Bayes' Theorem
Key Points
- •Bayes' Theorem tells you how to update the probability of a hypothesis after seeing new evidence.
- •It combines prior belief P(A) with the likelihood of the evidence P(B|A) and normalizes by the total probability of the evidence P(B).
- •In practice you often use the law of total probability to compute P(B) when multiple hypotheses are possible.
- •Naive Bayes assumes features are conditionally independent given the class, making P(x1,...,xd|C) factor into a product that is easy to compute.
- •Numerical stability is crucial; compute in log-space and use log-sum-exp to avoid underflow with many features or tiny probabilities.
- •Base rates matter: even accurate tests can yield many false positives when the condition is rare.
- •Bayesian updating can be done iteratively as new evidence arrives, which is useful in streaming and online settings.
- •In competitive programming, careful normalization and precision handling often decide correctness on Bayesian inference tasks.
Prerequisites
- →Basic probability and conditional probability — Bayes' Theorem directly relates conditional probabilities and requires comfort with P(A), P(B|A), and complements.
- →Summations and combinatorics — Total probability and Beta-Binomial formulas involve sums and combinations.
- →Logarithms and exponentials — Stable computation relies on log-space and the log-sum-exp identity.
- →C++ vectors, maps, and strings — Implementing Naive Bayes and hypothesis enumeration uses these data structures.
- →Floating-point precision — Understanding underflow/overflow and numerical stability is essential for probability computations.
Detailed Explanation
Tap terms for definitions01Overview
Bayes' Theorem is a rule for updating beliefs in light of new evidence. Imagine you have a hypothesis A (e.g., an email is spam), and you observe evidence B (e.g., the email contains the word "prize"). Bayes' Theorem tells you how to compute the updated probability P(A|B) by blending how plausible A was before seeing B (the prior, P(A)) with how compatible B is with A (the likelihood, P(B|A)), and normalizing by how likely B is overall (the evidence, P(B)). This single equation underlies a wide range of applications: medical test interpretation, spam filtering, recommender systems, A/B testing, and probabilistic reasoning problems in programming contests. The theorem connects conditional probabilities in both directions (A given B and B given A) and clarifies why base rates matter. In computation, we often handle multiple hypotheses A_i and compute a full posterior distribution P(A_i|B) across them. Practical implementations must guard against floating-point underflow; therefore, probabilities are frequently manipulated in logarithmic form, and normalization uses the log-sum-exp trick. Bayes' Theorem is not just a formula; it is a thinking framework: start with a prior, gather evidence via a likelihood model, and produce a posterior that becomes the new prior for future updates.
02Intuition & Analogies
Think of a detective updating a suspect list as clues arrive. Initially, each suspect has a prior level of suspicion. When a new clue B (like a fingerprint) appears, it might be very likely if a certain suspect A is guilty (high P(B|A)), but perhaps less likely if someone else did it. The detective weighs the plausibility of the clue under each suspect and updates their beliefs accordingly. The total number of people who could leave that fingerprint matters too; this is the base rate, encoded in P(B). Another analogy: weather and umbrellas. Suppose you believe there’s a 10% chance of rain (prior). You look outside and see wet streets (evidence). Wet streets are more likely if it rains (high P(wet|rain)) but can also occur due to sprinklers (non-zero P(wet|no rain)). Bayes says: multiply how much the evidence favors rain by how much you initially thought it would rain, then normalize by how often you’d see wet streets regardless. In medical testing, a very accurate test for a rare disease can still produce many false alarms, because most positives come from the large pool of healthy people—this is the base-rate effect. Computationally, imagine slicing a pie of prior belief among hypotheses, then stretching each slice by how well the evidence supports it; finally, you rescale so the slices sum to 1. That rescaling is the normalization by P(B). This mental model—prior slice × evidence stretch → renormalize—captures the essence of Bayesian updating.
03Formal Definition
04When to Use
Use Bayes’ Theorem whenever you want to update a probability based on new information and you can model how likely that information is under different hypotheses. Typical cases include: interpreting diagnostic tests (compute disease probability given a positive test), spam or sentiment classification with Naive Bayes (combine word-based evidence across many features), online learning or streaming updates (treat each observation as new evidence and update incrementally), and decision-making under uncertainty (compare posterior odds and act when they exceed a threshold). In competitive programming, turn problems framed as “given priors and conditional probabilities, find P(A|B)” into structured calculations: compute likelihoods, apply the total probability for normalization, and return the posterior with careful floating-point handling. When multiple mutually exclusive hypotheses exist (e.g., which biased die generated the sequence), Bayes provides a clean posterior over them. Bayesian conjugate priors (like Beta-Bernoulli) are ideal for coin-flip style problems, providing closed-form updates and predictions. Prefer Bayes when prior knowledge meaningfully shapes decisions, evidence arrives sequentially, or you need calibrated probabilities rather than hard classifications.
⚠️Common Mistakes
Common pitfalls include: (1) Ignoring base rates: people often conflate P(A|B) with P(B|A), overestimating the posterior when the condition is rare. Always compute or approximate P(B). (2) Double counting evidence: using the same observation twice in a product inflates confidence; ensure features are distinct, or account for dependence. (3) Violating Naive Bayes independence without caution: correlated features can overweight evidence. Mitigate by feature selection, decorrelation, or using models that capture dependence. (4) Numerical underflow: multiplying many small probabilities can collapse to zero. Work in log-space and use log-sum-exp for normalization. (5) Zero probabilities: unseen features give zero likelihood, killing a hypothesis. Use Laplace/additive smoothing. (6) Misinterpreting posterior odds as accuracy: a high P(A|B) doesn’t mean your classifier is accurate overall; evaluate with proper metrics and consider priors. (7) Overconfident priors: very strong priors can swamp data; test sensitivity to priors. (8) Rounding early: keep double precision and normalize at the end to preserve accuracy. (9) Forgetting to normalize across all hypotheses in multi-hypothesis problems; always divide by the total evidence sum. (10) Mixing counts and probabilities inconsistently; ensure you convert counts to probabilities correctly, and keep track of whether values are in probability or log-probability space.
Key Formulas
Bayes' Theorem
Explanation: Posterior equals prior times likelihood divided by evidence. Use this to update a hypothesis after observing evidence.
Law of Total Probability
Explanation: The probability of the evidence is the sum over all mutually exclusive hypotheses of likelihood times prior. This normalizes the posterior.
Discrete Posterior
Explanation: For multiple hypotheses, compute the unnormalized weight of each and divide by the total to get a proper distribution.
Odds-Likelihood Form
Explanation: Posterior odds equal prior odds times the likelihood ratio (Bayes factor). Useful for sequential updates and threshold decisions.
Naive Bayes Posterior (proportional)
Explanation: Assuming conditional independence, the joint likelihood factors into a product. Normalize over classes to get final probabilities.
Log-Sum-Exp Trick
Explanation: Recenter exponents by subtracting the maximum to avoid overflow/underflow, then add it back after summation in log-space.
Two-Hypothesis Update
Explanation: When comparing two hypotheses, the Bayes factor K scales the prior odds to get posterior odds, which convert to a probability.
Beta-Bernoulli Update
Explanation: With a Beta( prior and s successes, f failures, the posterior is Beta( This gives closed-form sequential updates.
Posterior Predictive (One-Step)
Explanation: Given a Beta posterior, the probability the next Bernoulli trial is a success is the posterior mean.
Beta-Binomial Predictive
Explanation: The marginal likelihood of observing s successes in n trials under a Beta prior. Useful for evidence and Bayes factor calculations.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Computes posterior probabilities for disease given a positive/negative test. 5 // Demonstrates Bayes' Theorem and the law of total probability. 6 int main() { 7 ios::sync_with_stdio(false); 8 cin.tie(nullptr); 9 10 // Example parameters 11 // Prevalence (prior): P(Disease) 12 double P_A = 0.01; // 1% prevalence 13 // Sensitivity: P(Test+ | Disease) 14 double P_B_given_A = 0.95; // 95% 15 // Specificity: P(Test- | No Disease) = 0.98 16 double specificity = 0.98; 17 // Therefore, P(Test+ | No Disease) = 1 - specificity 18 double P_B_given_notA = 1.0 - specificity; // 0.02 19 20 // Evidence (law of total probability): P(Test+) 21 double P_B = P_B_given_A * P_A + P_B_given_notA * (1.0 - P_A); 22 23 // Posterior: P(Disease | Test+) 24 double P_A_given_B = (P_B_given_A * P_A) / P_B; 25 26 // For completeness, compute P(Disease | Test-) 27 double P_notB_given_A = 1.0 - P_B_given_A; // P(Test- | Disease) 28 double P_notB_given_notA = specificity; // P(Test- | No Disease) 29 double P_notB = P_notB_given_A * P_A + P_notB_given_notA * (1.0 - P_A); 30 double P_A_given_notB = (P_notB_given_A * P_A) / P_notB; 31 32 cout.setf(std::ios::fixed); cout << setprecision(6); 33 cout << "P(Test+) = " << P_B << "\n"; 34 cout << "P(Disease | Test+) = " << P_A_given_B << "\n"; 35 cout << "P(Disease | Test-) = " << P_A_given_notB << "\n"; 36 37 // Odds form demonstration for Test+ 38 double prior_odds = P_A / (1.0 - P_A); 39 double likelihood_ratio = P_B_given_A / P_B_given_notA; 40 double posterior_odds = prior_odds * likelihood_ratio; 41 double posterior_prob_via_odds = posterior_odds / (1.0 + posterior_odds); 42 cout << "Posterior via odds (sanity check) = " << posterior_prob_via_odds << "\n"; 43 44 return 0; 45 } 46
Computes the posterior probability of disease given a positive (and negative) test using Bayes' Theorem. It shows how to get P(B) using the law of total probability and cross-checks the posterior using the odds-likelihood form.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Simple tokenizer: splits on non-letters and lowercases. 5 vector<string> tokenize(const string &s) { 6 vector<string> tokens; 7 string w; 8 for (char ch : s) { 9 if (isalpha(static_cast<unsigned char>(ch))) { 10 w.push_back(static_cast<char>(tolower(static_cast<unsigned char>(ch)))); 11 } else if (!w.empty()) { 12 tokens.push_back(w); 13 w.clear(); 14 } 15 } 16 if (!w.empty()) tokens.push_back(w); 17 return tokens; 18 } 19 20 int main() { 21 ios::sync_with_stdio(false); 22 cin.tie(nullptr); 23 24 // Toy training data: pair<label, message> 25 vector<pair<string,string>> train = { 26 {"spam", "Win a PRIZE now! Click here to claim"}, 27 {"spam", "Limited offer! Winner gets cash prize"}, 28 {"ham", "Lunch at noon? Meet at the cafe"}, 29 {"ham", "Project meeting moved to next week"}, 30 {"spam", "Claim your reward, limited time only"}, 31 {"ham", "Don't forget to bring the report"} 32 }; 33 34 // Count words per class and total words per class 35 unordered_map<string, int> classDocCount; // number of documents per class 36 unordered_map<string, long long> classTokenCount; // total tokens per class 37 unordered_map<string, unordered_map<string, long long>> wordCount; // class -> word -> count 38 unordered_set<string> vocabulary; 39 40 for (auto &ex : train) { 41 const string &label = ex.first; 42 const string &msg = ex.second; 43 classDocCount[label]++; 44 vector<string> toks = tokenize(msg); 45 classTokenCount[label] += (long long)toks.size(); 46 for (auto &w : toks) { 47 wordCount[label][w]++; 48 vocabulary.insert(w); 49 } 50 } 51 52 // Compute class priors 53 int totalDocs = (int)train.size(); 54 vector<string> classes; 55 for (auto &kv : classDocCount) classes.push_back(kv.first); 56 57 // Laplace smoothing parameter 58 const double alpha = 1.0; // add-one smoothing 59 const int V = (int)vocabulary.size(); 60 61 auto logP_class = [&](const string &cls) { 62 return log((double)classDocCount[cls]) - log((double)totalDocs); 63 }; 64 65 auto logP_word_given_class = [&](const string &w, const string &cls) { 66 long long cnt_wc = 0; 67 auto it = wordCount[cls].find(w); 68 if (it != wordCount[cls].end()) cnt_wc = it->second; 69 double num = cnt_wc + alpha; 70 double den = (double)classTokenCount[cls] + alpha * V; 71 return log(num) / 1.0 - log(den); // same as log(num/den) 72 }; 73 74 auto classify = [&](const string &msg) { 75 vector<string> toks = tokenize(msg); 76 // Count tokens (multinomial NB) 77 unordered_map<string,int> tf; 78 for (auto &w : toks) tf[w]++; 79 80 double bestLog = -1e300; 81 string best = ""; 82 for (auto &cls : classes) { 83 double lp = logP_class(cls); 84 for (auto &kv : tf) { 85 const string &w = kv.first; 86 int c = kv.second; 87 lp += c * logP_word_given_class(w, cls); 88 } 89 if (lp > bestLog) { 90 bestLog = lp; 91 best = cls; 92 } 93 } 94 return best; 95 }; 96 97 vector<string> tests = { 98 "You are a winner! Claim your cash prize now", 99 "Are we still on for lunch tomorrow?", 100 "The report deadline is extended", 101 "Click to win limited time reward" 102 }; 103 104 for (auto &msg : tests) { 105 cout << "Message: '" << msg << "' => predicted: " << classify(msg) << "\n"; 106 } 107 108 return 0; 109 } 110
Implements a small Naive Bayes text classifier. It learns word counts per class with Laplace smoothing, computes class priors, and scores messages in log-space to avoid underflow. It predicts the class with the highest posterior.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Compute log-sum-exp of a vector of log-values for numerical stability. 5 double logsumexp(const vector<double> &x) { 6 double m = -numeric_limits<double>::infinity(); 7 for (double v : x) m = max(m, v); 8 double s = 0.0; 9 for (double v : x) s += exp(v - m); 10 return m + log(s); 11 } 12 13 int main() { 14 ios::sync_with_stdio(false); 15 cin.tie(nullptr); 16 17 // Hypotheses: different dice models explaining observed rolls 18 // H0: fair die (1/6 each) 19 // H1: loaded die favoring 6 (p6=0.4, others equal among remaining 0.6) 20 // H2: loaded die favoring 1 (p1=0.35, others equal among remaining 0.65) 21 22 vector<vector<double>> models = { 23 {1/6.0, 1/6.0, 1/6.0, 1/6.0, 1/6.0, 1/6.0}, 24 {0.12, 0.12, 0.12, 0.12, 0.12, 0.40}, 25 {0.35, 0.13, 0.13, 0.13, 0.13, 0.13} 26 }; 27 vector<double> prior = {0.6, 0.2, 0.2}; // priors over hypotheses H0, H1, H2 28 29 // Observed evidence: sequence of die rolls (1..6) 30 vector<int> rolls = {6,6,2,6,5,6,1,6,6,3,6}; 31 32 int m = (int)models.size(); 33 vector<double> log_prior(m), log_lik(m), log_post(m); 34 35 for (int i = 0; i < m; ++i) log_prior[i] = log(prior[i]); 36 37 // Compute log-likelihood for each hypothesis: sum of log p(roll|Hi) 38 for (int i = 0; i < m; ++i) { 39 double s = 0.0; 40 for (int r : rolls) { 41 double p = models[i][r-1]; 42 s += log(p); 43 } 44 log_lik[i] = s; 45 } 46 47 // Unnormalized log-posterior = log prior + log likelihood 48 vector<double> log_unnorm(m); 49 for (int i = 0; i < m; ++i) log_unnorm[i] = log_prior[i] + log_lik[i]; 50 51 // Normalize using log-sum-exp 52 double lse = logsumexp(log_unnorm); 53 for (int i = 0; i < m; ++i) log_post[i] = log_unnorm[i] - lse; 54 55 cout.setf(std::ios::fixed); cout << setprecision(6); 56 for (int i = 0; i < m; ++i) { 57 cout << "Posterior P(H" << i << "|data) = " << exp(log_post[i]) << "\n"; 58 } 59 60 return 0; 61 } 62
Computes posteriors over multiple hypotheses (three dice models) using log-probabilities and log-sum-exp for stable normalization. This pattern generalizes to many Bayesian inference tasks in contests.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Log of Beta function via log-gamma: log B(a,b) = lgamma(a) + lgamma(b) - lgamma(a+b) 5 double logBeta(double a, double b) { 6 return lgamma(a) + lgamma(b) - lgamma(a + b); 7 } 8 9 int main() { 10 ios::sync_with_stdio(false); 11 cin.tie(nullptr); 12 13 // Prior: Beta(alpha, beta) over Bernoulli success probability p 14 double alpha = 2.0, beta = 2.0; // mildly informative prior centered at 0.5 15 16 // Observations: s successes, f failures 17 vector<int> data = {1,0,1,1,0,1,1,1,0,1}; // 1=success, 0=failure 18 int s = 0, f = 0; 19 for (int x : data) (x ? s : f)++; 20 int n = s + f; 21 22 // Posterior parameters 23 double alpha_post = alpha + s; 24 double beta_post = beta + f; 25 26 // Posterior mean and MAP (if alpha_post,beta_post > 1) 27 double posterior_mean = alpha_post / (alpha_post + beta_post); 28 double posterior_map = (alpha_post > 1.0 && beta_post > 1.0) ? 29 (alpha_post - 1.0) / (alpha_post + beta_post - 2.0) : 30 numeric_limits<double>::quiet_NaN(); 31 32 // Marginal likelihood of observing s successes in n trials under the Beta prior (Beta-Binomial) 33 // P(data) = C(n,s) * B(alpha+s, beta+f) / B(alpha, beta) 34 auto logChoose = [&](int nn, int kk){ return lgamma(nn + 1.0) - lgamma(kk + 1.0) - lgamma(nn - kk + 1.0); }; 35 double log_marginal = logChoose(n, s) + logBeta(alpha + s, beta + f) - logBeta(alpha, beta); 36 37 // Posterior predictive probability that next trial is a success 38 double p_next_success = alpha_post / (alpha_post + beta_post); 39 40 cout.setf(std::ios::fixed); cout << setprecision(6); 41 cout << "s = " << s << ", f = " << f << "\n"; 42 cout << "Posterior Beta(alpha', beta') = Beta(" << alpha_post << ", " << beta_post << ")\n"; 43 cout << "Posterior mean p = " << posterior_mean << "\n"; 44 cout << "Posterior MAP p = " << posterior_map << "\n"; 45 cout << "log P(data) (marginal likelihood) = " << log_marginal << "\n"; 46 cout << "P(next success) = " << p_next_success << "\n"; 47 48 return 0; 49 } 50
Shows closed-form Bayesian updating for Bernoulli data with a Beta prior. It computes the posterior parameters, posterior mean/MAP, the marginal likelihood via the Beta-Binomial formula, and the posterior predictive probability of the next success.