MathIntermediate

Bayes' Theorem

Key Points

  • Bayes' Theorem tells you how to update the probability of a hypothesis after seeing new evidence.
  • It combines prior belief P(A) with the likelihood of the evidence P(B|A) and normalizes by the total probability of the evidence P(B).
  • In practice you often use the law of total probability to compute P(B) when multiple hypotheses are possible.
  • Naive Bayes assumes features are conditionally independent given the class, making P(x1,...,xd|C) factor into a product that is easy to compute.
  • Numerical stability is crucial; compute in log-space and use log-sum-exp to avoid underflow with many features or tiny probabilities.
  • Base rates matter: even accurate tests can yield many false positives when the condition is rare.
  • Bayesian updating can be done iteratively as new evidence arrives, which is useful in streaming and online settings.
  • In competitive programming, careful normalization and precision handling often decide correctness on Bayesian inference tasks.

Prerequisites

  • Basic probability and conditional probabilityBayes' Theorem directly relates conditional probabilities and requires comfort with P(A), P(B|A), and complements.
  • Summations and combinatoricsTotal probability and Beta-Binomial formulas involve sums and combinations.
  • Logarithms and exponentialsStable computation relies on log-space and the log-sum-exp identity.
  • C++ vectors, maps, and stringsImplementing Naive Bayes and hypothesis enumeration uses these data structures.
  • Floating-point precisionUnderstanding underflow/overflow and numerical stability is essential for probability computations.

Detailed Explanation

Tap terms for definitions

01Overview

Bayes' Theorem is a rule for updating beliefs in light of new evidence. Imagine you have a hypothesis A (e.g., an email is spam), and you observe evidence B (e.g., the email contains the word "prize"). Bayes' Theorem tells you how to compute the updated probability P(A|B) by blending how plausible A was before seeing B (the prior, P(A)) with how compatible B is with A (the likelihood, P(B|A)), and normalizing by how likely B is overall (the evidence, P(B)). This single equation underlies a wide range of applications: medical test interpretation, spam filtering, recommender systems, A/B testing, and probabilistic reasoning problems in programming contests. The theorem connects conditional probabilities in both directions (A given B and B given A) and clarifies why base rates matter. In computation, we often handle multiple hypotheses A_i and compute a full posterior distribution P(A_i|B) across them. Practical implementations must guard against floating-point underflow; therefore, probabilities are frequently manipulated in logarithmic form, and normalization uses the log-sum-exp trick. Bayes' Theorem is not just a formula; it is a thinking framework: start with a prior, gather evidence via a likelihood model, and produce a posterior that becomes the new prior for future updates.

02Intuition & Analogies

Think of a detective updating a suspect list as clues arrive. Initially, each suspect has a prior level of suspicion. When a new clue B (like a fingerprint) appears, it might be very likely if a certain suspect A is guilty (high P(B|A)), but perhaps less likely if someone else did it. The detective weighs the plausibility of the clue under each suspect and updates their beliefs accordingly. The total number of people who could leave that fingerprint matters too; this is the base rate, encoded in P(B). Another analogy: weather and umbrellas. Suppose you believe there’s a 10% chance of rain (prior). You look outside and see wet streets (evidence). Wet streets are more likely if it rains (high P(wet|rain)) but can also occur due to sprinklers (non-zero P(wet|no rain)). Bayes says: multiply how much the evidence favors rain by how much you initially thought it would rain, then normalize by how often you’d see wet streets regardless. In medical testing, a very accurate test for a rare disease can still produce many false alarms, because most positives come from the large pool of healthy people—this is the base-rate effect. Computationally, imagine slicing a pie of prior belief among hypotheses, then stretching each slice by how well the evidence supports it; finally, you rescale so the slices sum to 1. That rescaling is the normalization by P(B). This mental model—prior slice × evidence stretch → renormalize—captures the essence of Bayesian updating.

03Formal Definition

For events A and B with P(B) > 0, Bayes’ Theorem states: P(A|B) = . Here, P(AA) is the likelihood of observing B if A is true; and P(B) is the evidence, computed as P(B) = P(B A)P( A) in the binary case, or more generally by the law of total probability over a partition \{\}: P(B) = P(BB) = . In continuous settings with densities, the theorem becomes (a|b) = , with integrals replacing sums. A useful odds form is = , where the second factor is the likelihood ratio (Bayes factor). Under the Naive Bayes assumption of conditional independence of features ,, given class C, the joint likelihood factors as P(,,C), enabling efficient computation.

04When to Use

Use Bayes’ Theorem whenever you want to update a probability based on new information and you can model how likely that information is under different hypotheses. Typical cases include: interpreting diagnostic tests (compute disease probability given a positive test), spam or sentiment classification with Naive Bayes (combine word-based evidence across many features), online learning or streaming updates (treat each observation as new evidence and update incrementally), and decision-making under uncertainty (compare posterior odds and act when they exceed a threshold). In competitive programming, turn problems framed as “given priors and conditional probabilities, find P(A|B)” into structured calculations: compute likelihoods, apply the total probability for normalization, and return the posterior with careful floating-point handling. When multiple mutually exclusive hypotheses exist (e.g., which biased die generated the sequence), Bayes provides a clean posterior over them. Bayesian conjugate priors (like Beta-Bernoulli) are ideal for coin-flip style problems, providing closed-form updates and predictions. Prefer Bayes when prior knowledge meaningfully shapes decisions, evidence arrives sequentially, or you need calibrated probabilities rather than hard classifications.

⚠️Common Mistakes

Common pitfalls include: (1) Ignoring base rates: people often conflate P(A|B) with P(B|A), overestimating the posterior when the condition is rare. Always compute or approximate P(B). (2) Double counting evidence: using the same observation twice in a product inflates confidence; ensure features are distinct, or account for dependence. (3) Violating Naive Bayes independence without caution: correlated features can overweight evidence. Mitigate by feature selection, decorrelation, or using models that capture dependence. (4) Numerical underflow: multiplying many small probabilities can collapse to zero. Work in log-space and use log-sum-exp for normalization. (5) Zero probabilities: unseen features give zero likelihood, killing a hypothesis. Use Laplace/additive smoothing. (6) Misinterpreting posterior odds as accuracy: a high P(A|B) doesn’t mean your classifier is accurate overall; evaluate with proper metrics and consider priors. (7) Overconfident priors: very strong priors can swamp data; test sensitivity to priors. (8) Rounding early: keep double precision and normalize at the end to preserve accuracy. (9) Forgetting to normalize across all hypotheses in multi-hypothesis problems; always divide by the total evidence sum. (10) Mixing counts and probabilities inconsistently; ensure you convert counts to probabilities correctly, and keep track of whether values are in probability or log-probability space.

Key Formulas

Bayes' Theorem

Explanation: Posterior equals prior times likelihood divided by evidence. Use this to update a hypothesis after observing evidence.

Law of Total Probability

Explanation: The probability of the evidence is the sum over all mutually exclusive hypotheses of likelihood times prior. This normalizes the posterior.

Discrete Posterior

Explanation: For multiple hypotheses, compute the unnormalized weight of each and divide by the total to get a proper distribution.

Odds-Likelihood Form

Explanation: Posterior odds equal prior odds times the likelihood ratio (Bayes factor). Useful for sequential updates and threshold decisions.

Naive Bayes Posterior (proportional)

Explanation: Assuming conditional independence, the joint likelihood factors into a product. Normalize over classes to get final probabilities.

Log-Sum-Exp Trick

Explanation: Recenter exponents by subtracting the maximum to avoid overflow/underflow, then add it back after summation in log-space.

Two-Hypothesis Update

Explanation: When comparing two hypotheses, the Bayes factor K scales the prior odds to get posterior odds, which convert to a probability.

Beta-Bernoulli Update

Explanation: With a Beta( prior and s successes, f failures, the posterior is Beta( This gives closed-form sequential updates.

Posterior Predictive (One-Step)

Explanation: Given a Beta posterior, the probability the next Bernoulli trial is a success is the posterior mean.

Beta-Binomial Predictive

Explanation: The marginal likelihood of observing s successes in n trials under a Beta prior. Useful for evidence and Bayes factor calculations.

Complexity Analysis

Computing a single binary Bayes update is O(1) time and O(1) space, since it involves a few arithmetic operations. For multiple hypotheses {}, a straightforward posterior computation requires evaluating each unnormalized weight )P() and normalizing: this costs O(m) time and O(m) space for m hypotheses. In practice, when evidence is composed of many independent features (Naive Bayes), evaluating P(B|) becomes a product over d features, leading to O(m·d) time per update. To prevent numerical underflow from multiplying small probabilities, implementations switch to log-space: summations of log-likelihoods stay stable and maintain the same asymptotic cost. Normalization via log-sum-exp also runs in O(m). For text classification with Naive Bayes, training requires building counts for a vocabulary of size V across C classes from N tokens, which is O(N) time and O(V·C) space; prediction for a document with L tokens is O(L·C) time and O(1) additional space beyond the model. In conjugate-prior models like Beta-Bernoulli, updates are O(1) per observation (just increment α or and O(1) space to store the parameters; computing marginal likelihoods using log-gamma functions is O(1) as well. In competitive programming constraints, m and d are often up to 1e5; using logs, careful memory layout (vectors), and avoiding maps in tight loops can meet time limits. The main performance pitfalls are hash-map overhead in tokenization, repeated normalization inside loops (normalize once per evidence set), and excessive precision conversions.

Code Examples

Medical test interpretation with Bayes' Theorem (binary case)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Computes posterior probabilities for disease given a positive/negative test.
5// Demonstrates Bayes' Theorem and the law of total probability.
6int main() {
7 ios::sync_with_stdio(false);
8 cin.tie(nullptr);
9
10 // Example parameters
11 // Prevalence (prior): P(Disease)
12 double P_A = 0.01; // 1% prevalence
13 // Sensitivity: P(Test+ | Disease)
14 double P_B_given_A = 0.95; // 95%
15 // Specificity: P(Test- | No Disease) = 0.98
16 double specificity = 0.98;
17 // Therefore, P(Test+ | No Disease) = 1 - specificity
18 double P_B_given_notA = 1.0 - specificity; // 0.02
19
20 // Evidence (law of total probability): P(Test+)
21 double P_B = P_B_given_A * P_A + P_B_given_notA * (1.0 - P_A);
22
23 // Posterior: P(Disease | Test+)
24 double P_A_given_B = (P_B_given_A * P_A) / P_B;
25
26 // For completeness, compute P(Disease | Test-)
27 double P_notB_given_A = 1.0 - P_B_given_A; // P(Test- | Disease)
28 double P_notB_given_notA = specificity; // P(Test- | No Disease)
29 double P_notB = P_notB_given_A * P_A + P_notB_given_notA * (1.0 - P_A);
30 double P_A_given_notB = (P_notB_given_A * P_A) / P_notB;
31
32 cout.setf(std::ios::fixed); cout << setprecision(6);
33 cout << "P(Test+) = " << P_B << "\n";
34 cout << "P(Disease | Test+) = " << P_A_given_B << "\n";
35 cout << "P(Disease | Test-) = " << P_A_given_notB << "\n";
36
37 // Odds form demonstration for Test+
38 double prior_odds = P_A / (1.0 - P_A);
39 double likelihood_ratio = P_B_given_A / P_B_given_notA;
40 double posterior_odds = prior_odds * likelihood_ratio;
41 double posterior_prob_via_odds = posterior_odds / (1.0 + posterior_odds);
42 cout << "Posterior via odds (sanity check) = " << posterior_prob_via_odds << "\n";
43
44 return 0;
45}
46

Computes the posterior probability of disease given a positive (and negative) test using Bayes' Theorem. It shows how to get P(B) using the law of total probability and cross-checks the posterior using the odds-likelihood form.

Time: O(1)Space: O(1)
Naive Bayes text classifier with Laplace smoothing (toy spam filter)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Simple tokenizer: splits on non-letters and lowercases.
5vector<string> tokenize(const string &s) {
6 vector<string> tokens;
7 string w;
8 for (char ch : s) {
9 if (isalpha(static_cast<unsigned char>(ch))) {
10 w.push_back(static_cast<char>(tolower(static_cast<unsigned char>(ch))));
11 } else if (!w.empty()) {
12 tokens.push_back(w);
13 w.clear();
14 }
15 }
16 if (!w.empty()) tokens.push_back(w);
17 return tokens;
18}
19
20int main() {
21 ios::sync_with_stdio(false);
22 cin.tie(nullptr);
23
24 // Toy training data: pair<label, message>
25 vector<pair<string,string>> train = {
26 {"spam", "Win a PRIZE now! Click here to claim"},
27 {"spam", "Limited offer! Winner gets cash prize"},
28 {"ham", "Lunch at noon? Meet at the cafe"},
29 {"ham", "Project meeting moved to next week"},
30 {"spam", "Claim your reward, limited time only"},
31 {"ham", "Don't forget to bring the report"}
32 };
33
34 // Count words per class and total words per class
35 unordered_map<string, int> classDocCount; // number of documents per class
36 unordered_map<string, long long> classTokenCount; // total tokens per class
37 unordered_map<string, unordered_map<string, long long>> wordCount; // class -> word -> count
38 unordered_set<string> vocabulary;
39
40 for (auto &ex : train) {
41 const string &label = ex.first;
42 const string &msg = ex.second;
43 classDocCount[label]++;
44 vector<string> toks = tokenize(msg);
45 classTokenCount[label] += (long long)toks.size();
46 for (auto &w : toks) {
47 wordCount[label][w]++;
48 vocabulary.insert(w);
49 }
50 }
51
52 // Compute class priors
53 int totalDocs = (int)train.size();
54 vector<string> classes;
55 for (auto &kv : classDocCount) classes.push_back(kv.first);
56
57 // Laplace smoothing parameter
58 const double alpha = 1.0; // add-one smoothing
59 const int V = (int)vocabulary.size();
60
61 auto logP_class = [&](const string &cls) {
62 return log((double)classDocCount[cls]) - log((double)totalDocs);
63 };
64
65 auto logP_word_given_class = [&](const string &w, const string &cls) {
66 long long cnt_wc = 0;
67 auto it = wordCount[cls].find(w);
68 if (it != wordCount[cls].end()) cnt_wc = it->second;
69 double num = cnt_wc + alpha;
70 double den = (double)classTokenCount[cls] + alpha * V;
71 return log(num) / 1.0 - log(den); // same as log(num/den)
72 };
73
74 auto classify = [&](const string &msg) {
75 vector<string> toks = tokenize(msg);
76 // Count tokens (multinomial NB)
77 unordered_map<string,int> tf;
78 for (auto &w : toks) tf[w]++;
79
80 double bestLog = -1e300;
81 string best = "";
82 for (auto &cls : classes) {
83 double lp = logP_class(cls);
84 for (auto &kv : tf) {
85 const string &w = kv.first;
86 int c = kv.second;
87 lp += c * logP_word_given_class(w, cls);
88 }
89 if (lp > bestLog) {
90 bestLog = lp;
91 best = cls;
92 }
93 }
94 return best;
95 };
96
97 vector<string> tests = {
98 "You are a winner! Claim your cash prize now",
99 "Are we still on for lunch tomorrow?",
100 "The report deadline is extended",
101 "Click to win limited time reward"
102 };
103
104 for (auto &msg : tests) {
105 cout << "Message: '" << msg << "' => predicted: " << classify(msg) << "\n";
106 }
107
108 return 0;
109}
110

Implements a small Naive Bayes text classifier. It learns word counts per class with Laplace smoothing, computes class priors, and scores messages in log-space to avoid underflow. It predicts the class with the highest posterior.

Time: Training: O(N) over total tokens; Prediction per message: O(L · C)Space: O(V · C) for word counts
Multi-hypothesis posterior with log-sum-exp normalization
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute log-sum-exp of a vector of log-values for numerical stability.
5double logsumexp(const vector<double> &x) {
6 double m = -numeric_limits<double>::infinity();
7 for (double v : x) m = max(m, v);
8 double s = 0.0;
9 for (double v : x) s += exp(v - m);
10 return m + log(s);
11}
12
13int main() {
14 ios::sync_with_stdio(false);
15 cin.tie(nullptr);
16
17 // Hypotheses: different dice models explaining observed rolls
18 // H0: fair die (1/6 each)
19 // H1: loaded die favoring 6 (p6=0.4, others equal among remaining 0.6)
20 // H2: loaded die favoring 1 (p1=0.35, others equal among remaining 0.65)
21
22 vector<vector<double>> models = {
23 {1/6.0, 1/6.0, 1/6.0, 1/6.0, 1/6.0, 1/6.0},
24 {0.12, 0.12, 0.12, 0.12, 0.12, 0.40},
25 {0.35, 0.13, 0.13, 0.13, 0.13, 0.13}
26 };
27 vector<double> prior = {0.6, 0.2, 0.2}; // priors over hypotheses H0, H1, H2
28
29 // Observed evidence: sequence of die rolls (1..6)
30 vector<int> rolls = {6,6,2,6,5,6,1,6,6,3,6};
31
32 int m = (int)models.size();
33 vector<double> log_prior(m), log_lik(m), log_post(m);
34
35 for (int i = 0; i < m; ++i) log_prior[i] = log(prior[i]);
36
37 // Compute log-likelihood for each hypothesis: sum of log p(roll|Hi)
38 for (int i = 0; i < m; ++i) {
39 double s = 0.0;
40 for (int r : rolls) {
41 double p = models[i][r-1];
42 s += log(p);
43 }
44 log_lik[i] = s;
45 }
46
47 // Unnormalized log-posterior = log prior + log likelihood
48 vector<double> log_unnorm(m);
49 for (int i = 0; i < m; ++i) log_unnorm[i] = log_prior[i] + log_lik[i];
50
51 // Normalize using log-sum-exp
52 double lse = logsumexp(log_unnorm);
53 for (int i = 0; i < m; ++i) log_post[i] = log_unnorm[i] - lse;
54
55 cout.setf(std::ios::fixed); cout << setprecision(6);
56 for (int i = 0; i < m; ++i) {
57 cout << "Posterior P(H" << i << "|data) = " << exp(log_post[i]) << "\n";
58 }
59
60 return 0;
61}
62

Computes posteriors over multiple hypotheses (three dice models) using log-probabilities and log-sum-exp for stable normalization. This pattern generalizes to many Bayesian inference tasks in contests.

Time: O(m · n) where m is hypotheses and n is number of rollsSpace: O(m)
Beta-Bernoulli conjugate update and marginal likelihood
1#include <bits/stdc++.h>
2using namespace std;
3
4// Log of Beta function via log-gamma: log B(a,b) = lgamma(a) + lgamma(b) - lgamma(a+b)
5double logBeta(double a, double b) {
6 return lgamma(a) + lgamma(b) - lgamma(a + b);
7}
8
9int main() {
10 ios::sync_with_stdio(false);
11 cin.tie(nullptr);
12
13 // Prior: Beta(alpha, beta) over Bernoulli success probability p
14 double alpha = 2.0, beta = 2.0; // mildly informative prior centered at 0.5
15
16 // Observations: s successes, f failures
17 vector<int> data = {1,0,1,1,0,1,1,1,0,1}; // 1=success, 0=failure
18 int s = 0, f = 0;
19 for (int x : data) (x ? s : f)++;
20 int n = s + f;
21
22 // Posterior parameters
23 double alpha_post = alpha + s;
24 double beta_post = beta + f;
25
26 // Posterior mean and MAP (if alpha_post,beta_post > 1)
27 double posterior_mean = alpha_post / (alpha_post + beta_post);
28 double posterior_map = (alpha_post > 1.0 && beta_post > 1.0) ?
29 (alpha_post - 1.0) / (alpha_post + beta_post - 2.0) :
30 numeric_limits<double>::quiet_NaN();
31
32 // Marginal likelihood of observing s successes in n trials under the Beta prior (Beta-Binomial)
33 // P(data) = C(n,s) * B(alpha+s, beta+f) / B(alpha, beta)
34 auto logChoose = [&](int nn, int kk){ return lgamma(nn + 1.0) - lgamma(kk + 1.0) - lgamma(nn - kk + 1.0); };
35 double log_marginal = logChoose(n, s) + logBeta(alpha + s, beta + f) - logBeta(alpha, beta);
36
37 // Posterior predictive probability that next trial is a success
38 double p_next_success = alpha_post / (alpha_post + beta_post);
39
40 cout.setf(std::ios::fixed); cout << setprecision(6);
41 cout << "s = " << s << ", f = " << f << "\n";
42 cout << "Posterior Beta(alpha', beta') = Beta(" << alpha_post << ", " << beta_post << ")\n";
43 cout << "Posterior mean p = " << posterior_mean << "\n";
44 cout << "Posterior MAP p = " << posterior_map << "\n";
45 cout << "log P(data) (marginal likelihood) = " << log_marginal << "\n";
46 cout << "P(next success) = " << p_next_success << "\n";
47
48 return 0;
49}
50

Shows closed-form Bayesian updating for Bernoulli data with a Beta prior. It computes the posterior parameters, posterior mean/MAP, the marginal likelihood via the Beta-Binomial formula, and the posterior predictive probability of the next success.

Time: O(n) to scan observations; O(1) for updates and closed-form computationsSpace: O(1)