🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
đŸ›€ïžPaths📚Topics💡Concepts🎮Shorts
🎯Practice
⏱CoachđŸ§©Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
∑MathIntermediate

Probability Axioms & Rules

Key Points

  • ‱
    Kolmogorov’s axioms define probability as a measure on events: non-negativity, normalization, and countable additivity.
  • ‱
    From these axioms you can derive key rules like complements, inclusion–exclusion, conditional probability, and Bayes’ theorem.
  • ‱
    Events are subsets of a sample space, and probabilities add for disjoint (mutually exclusive) events.
  • ‱
    Conditional probability reweights the sample space to focus on a given condition, enabling the chain rule and Bayesian inference.
  • ‱
    Independence means intersection probabilities factor into products, not the same thing as being mutually exclusive.
  • ‱
    In code, finite probability spaces can be represented with maps and sets; Monte Carlo can approximate probabilities via simulation.
  • ‱
    Always check normalization (sum to 1) and guard against floating-point error when verifying axioms numerically.
  • ‱
    Use these rules to model games of chance, reliability, classification, risk analysis, and to reason from data to causes.

Prerequisites

  • →Set theory basics — Events are subsets; unions, intersections, complements are central.
  • →Functions and measures — Probability is a measure mapping sets to [0,1].
  • →Series and summations — Countable additivity involves infinite sums.
  • →Basic combinatorics — Counting outcomes to build discrete probability models.
  • →Conditional statements and logic — Understanding implications and case splits for total probability and Bayes.
  • →C++ STL: vectors, sets, maps, RNG — Implementing finite spaces and Monte Carlo simulations.
  • →Floating-point arithmetic — Handling rounding errors and tolerances in probability computations.

Detailed Explanation

Tap terms for definitions

01Overview

Hook: Imagine running a fair game night—rolling dice, shuffling cards, or flipping coins—and needing a consistent way to assign chances to outcomes. You want rules that don’t contradict each other, regardless of the game or how many events you combine. Concept: Probability axioms, introduced by Kolmogorov, give a rigorous foundation for assigning and combining probabilities. They treat probability as a measure on a collection of events (subsets of outcomes). With only three axioms—non-negativity, normalization, and countable additivity—you can derive most of the everyday rules students use: complement rule, inclusion–exclusion, conditional probability, chain rule, independence, and Bayes’ theorem. Example: For a fair die, each face has probability 1/6. The event “even number” has probability 1/6 + 1/6 + 1/6 = 1/2, and the event “at least 5” has probability 1/3. The probability of their union can be computed by inclusion–exclusion: 1/2 + 1/3 − 1/6 = 2/3. These axioms matter beyond games: they underpin machine learning (Bayesian updating), risk assessment (union bounds), quality control (independence assumptions), and randomized algorithms. In programming, we encode finite probability spaces with sets and maps or approximate them by Monte Carlo simulation. The axioms act as invariants we can test in code (sum to 1, non-negative, add over disjoint events), ensuring our models are coherent.

02Intuition & Analogies

Hook: Think of probability as spreading a fixed amount of paint—exactly one bucket—over all possible outcomes. You can’t have negative paint, the total paint is one bucket, and if two patches don’t overlap, the paint on their union is just the sum on each patch. Concept: The sample space Ω is the canvas of all outcomes. Events are regions on the canvas (subsets of Ω). Probability P is the amount of paint on an event. Non-negativity means you never owe paint: P(A) ≄ 0. Normalization says all paint is used: P(Ω) = 1. Countable additivity says if events don’t overlap, then the paint on their union equals the sum of their paints. From this, intuition about complements and overlaps follows naturally: if A takes up some paint, what’s left over (A’s complement) must be 1 − P(A). If two events overlap, adding their paints double-counts the overlap, so we subtract it once (inclusion–exclusion). Example: Picture two translucent stickers on a window: A (even roll) and B (≄ 4) for a die. The combined tint from sticking both equals the tint of A plus the tint of B minus the area they overlap. If the stickers never overlap (mutually exclusive events), their tints just add. Conditioning is like putting on polarized glasses: you restrict your view to B and then re-measure A’s tint relative to what you can still see—this is P(A | B). Independence is when two stickers’ shades are arranged so that the overlap’s tint is exactly the product of their individual tints, reflecting no informative alignment between them. This mental model helps explain why the rules look the way they do.

03Formal Definition

Let (Ω, F) be a measurable space where Ω is a sample space and F is a σ-algebra of subsets of Ω (events). A probability measure is a function P : F → [0,1] such that: 1) Non-negativity: For all A ∈ F, P(A) ≄ 0. 2) Normalization: P(Ω) = 1. 3) Countable additivity (σ-additivity): For any countable collection of pairwise disjoint events (Ai​)_{i=1}^{\infty} in F, P\big(⋃i=1∞​ A_i\big) = ∑i=1∞​ P(Ai​). From these axioms one can derive: complement P(Ac) = 1 - P(A); monotonicity A ⊆ B ⇒ P(A) ≀ P(B); inclusion–exclusion for finite unions; and continuity from above/below. Conditional probability is defined, for B with P(B) > 0, by P(A ∣ B) = P(B)P(A∩B)​. Independence for events A, B means P(A ∩ B) = P(A)P(B). A partition (Bi​) with ⋃i​ Bi​ = Ω and disjoint Bi​ yields the law of total probability: P(A) = ∑i​ P(A ∣ Bi​)P(Bi​). Bayes’ theorem inverts conditioning: P(Bj​ ∣ A) = ∑i​P(A∣Bi​)P(Bi​)P(A∣Bj​)P(Bj​)​. These constructions remain grounded in the three axioms, ensuring internal consistency of derived rules.

04When to Use

Use the axioms whenever you model uncertainty in a principled way. They define the legal moves for building complex probabilities from simpler ones.

  • Finite models: Cards, dice, lotteries, or any discrete system with countable outcomes. Represent Ω explicitly, assign a probability mass function (PMF), and use additivity for unions and intersections.
  • Risk aggregation: Estimating the chance of at least one failure among many components. Apply the union bound or inclusion–exclusion; assume independence only when justified.
  • Classification and diagnostics: Apply conditional probability, the law of total probability, and Bayes’ theorem to fuse evidence with prior knowledge (e.g., medical tests with known sensitivity/specificity and disease prevalence).
  • Randomized algorithms and simulations: Use Monte Carlo to approximate probabilities that are hard to compute exactly; check normalization and consistency numerically to detect implementation errors.
  • Probabilistic reasoning pipelines: Use the chain rule to factor joint probabilities as products of conditionals (basis of Bayesian networks and hidden Markov models). The axioms ensure the factors combine into a coherent joint distribution. Example: To compute the probability a password is compromised by either phishing or brute force, model each event, decide if they’re independent, and compute P(phishing âˆȘ brute) by inclusion–exclusion. To interpret a positive spam filter result, use Bayes’ theorem to compute the posterior probability of spam given the classifier’s scores.

⚠Common Mistakes

  1. Confusing independence with mutual exclusivity: Mutually exclusive non-trivial events cannot be independent because P(A ∩ B) = 0 but P(A)P(B) > 0 if both have positive probability.
  2. Forgetting to subtract overlaps: Using P(A âˆȘ B) = P(A) + P(B) even when A and B are not disjoint. Always apply inclusion–exclusion or compute intersections explicitly.
  3. Ignoring base rates in Bayes’ theorem: Overestimating P(disease | positive) by focusing only on sensitivity/specificity and neglecting low prevalence.
  4. Misusing conditional probability when P(B) = 0: P(A | B) is undefined if P(B) = 0; handle such cases explicitly in code.
  5. Over-trusting small-sample simulations: Monte Carlo estimates with small N can be noisy; report confidence intervals and use enough samples.
  6. Floating-point pitfalls: Sums like P(A) + P(A^c) may be 0.9999999 due to rounding; use tolerances. Normalize PMFs to sum to 1.
  7. Hidden assumptions of independence: Applying P(A ∩ B) = P(A)P(B) without justification can severely bias results. Validate with data or domain knowledge. Example fixes: For unions, compute P(A) + P(B) − P(A ∩ B). For Bayes, include the prior P(disease). In code, check non-negativity, sum-to-one within an epsilon (e.g., 1e−12), and guard P(B) > 0 before dividing.

Key Formulas

Non-negativity Axiom

P(A)≄0

Explanation: Every event has a probability that is zero or positive. Negative probabilities are not allowed.

Normalization Axiom

P(Ω)=1

Explanation: The total probability over the entire sample space is 1, meaning something in the sample space must occur.

Countable Additivity Axiom

P(i=1⋃∞​Ai​)=i=1∑∞​P(Ai​)for disjoint Ai​

Explanation: If events do not overlap, the probability of their union is the sum of their probabilities, even for countably many events.

Complement Rule

P(Ac)=1−P(A)

Explanation: The probability of not A is the leftover mass after accounting for A. Useful when A is complicated but its complement is simple.

Monotonicity

A⊆B⇒P(A)≀P(B)

Explanation: B includes at least all outcomes in A, so its probability cannot be smaller. This follows from the axioms.

Inclusion–Exclusion (Two Sets)

P(AâˆȘB)=P(A)+P(B)−P(A∩B)

Explanation: Adds probabilities of A and B but corrects for double-counting the intersection. Essential for overlapping events.

Union Bound (Boole)

P(i=1⋃n​Ai​)≀i=1∑n​P(Ai​)

Explanation: Provides a simple upper bound on the probability of at least one event happening. Useful when intersections are hard to compute.

Conditional Probability

P(A∣B)=P(B)P(A∩B)​(P(B)>0)

Explanation: Reweights the sample space to the condition B. Foundational for reasoning with partial information.

Multiplication Rule

P(A∩B)=P(A∣B)P(B)=P(B∣A)P(A)

Explanation: Expresses joint probability via a conditional and a marginal. Basis for the chain rule.

Law of Total Probability

P(A)=i=1∑k​P(A∣Bi​)P(Bi​)for a partition {Bi​}

Explanation: Decomposes P(A) across disjoint cases that cover the sample space. Useful for case analysis.

Bayes’ Theorem (Finite Partition)

P(Bj​∣A)=∑i=1k​P(A∣Bi​)P(Bi​)P(A∣Bj​)P(Bj​)​

Explanation: Updates prior beliefs P(Bj​) after seeing evidence A using likelihoods P(A | Bj​). Core to Bayesian inference.

Independence Criterion

P(A∩B)=P(A)P(B)iff A and B independent

Explanation: Defines independence: knowing one event doesn’t change the probability of the other.

Chain Rule

P(i=1⋂n​Ai​)=i=1∏n​P(Ai​∣A1â€‹âˆ©â‹Żâˆ©Ai−1​)

Explanation: Breaks a joint probability into a product of conditional probabilities. Fundamental in graphical models and HMMs.

Inclusion–Exclusion (Three Sets)

P(AâˆȘBâˆȘC)=∑P(singles)−∑P(pairs)+P(A∩B∩C)

Explanation: Adds singles, subtracts pairwise overlaps, and adds back the triple overlap. Extends to larger finite unions.

Complexity Analysis

For finite probability spaces represented explicitly, let n = ∣Ω∣ and let events be stored as hash sets. Computing P(A) by summing PMF entries over an event of size k runs in O(k) time and O(1) auxiliary space beyond the event storage. Computing intersections or unions of two events of sizes a and b using hash sets takes O(min(a, b)) expected time for intersection (probing membership in the larger set) and up to O(a + b) for unions, with O(a + b) space to store the resulting set if materialized. Checking normalization involves summing over all outcomes once, O(n) time and O(1) space. Verifying additivity for two disjoint events requires computing P(A), P(B), and P(A âˆȘ B); constructing A âˆȘ B costs O(a + b) time, or you can compute P(A âˆȘ B) directly as P(A) + P(B) if disjointness has been checked in O(min(a, b)). Inclusion–exclusion for overlapping events adds one intersection computation, O(min(a, b)). Monte Carlo simulation draws N samples and evaluates indicator functions for events, running in O(N) time and O(1) additional space when streaming counts. The standard error of an estimated probability p^​ scales like O(p(1−p)/N​), so halving the error typically requires quadrupling N. Random number generation cost is O(1) per sample on average with modern engines (e.g., mt19937_64). In all examples, memory is dominated by storing the PMF (O(n)) and any explicit event sets (O(k)). For large or continuous spaces, explicit enumeration is infeasible; one then uses factored representations (e.g., graphical models) or sampling methods, but the axioms and derived rules still govern correctness.

Code Examples

Finite Probability Space: Verify Axioms and Inclusion–Exclusion on a Fair Die
1#include <bits/stdc++.h>
2using namespace std;
3
4// A simple finite probability space with outcomes of type int
5struct ProbabilitySpace {
6 unordered_map<int, double> pmf; // P({w}) for each outcome w
7
8 // Check normalization and non-negativity within a tolerance
9 bool validate(double eps = 1e-12) const {
10 double sum = 0.0;
11 for (auto &kv : pmf) {
12 if (kv.second < -eps) return false; // negative mass
13 sum += kv.second;
14 }
15 return fabs(sum - 1.0) <= 1e-10; // allow small floating error
16 }
17
18 // Compute probability of an event represented as a set of outcomes
19 double prob(const unordered_set<int> &event) const {
20 double p = 0.0;
21 for (int w : event) {
22 auto it = pmf.find(w);
23 if (it != pmf.end()) p += it->second;
24 }
25 return p;
26 }
27};
28
29// Set utilities
30unordered_set<int> set_union_us(const unordered_set<int>& A, const unordered_set<int>& B) {
31 unordered_set<int> U; U.reserve(A.size() + B.size());
32 for (int x : A) U.insert(x);
33 for (int x : B) U.insert(x);
34 return U;
35}
36
37unordered_set<int> set_intersection_us(const unordered_set<int>& A, const unordered_set<int>& B) {
38 const unordered_set<int> *S = &A, *L = &B;
39 if (A.size() > B.size()) { S = &B; L = &A; }
40 unordered_set<int> I; I.reserve(S->size());
41 for (int x : *S) if (L->count(x)) I.insert(x);
42 return I;
43}
44
45bool disjoint(const unordered_set<int>& A, const unordered_set<int>& B) {
46 const unordered_set<int> *S = &A, *L = &B;
47 if (A.size() > B.size()) { S = &B; L = &A; }
48 for (int x : *S) if (L->count(x)) return false;
49 return true;
50}
51
52int main() {
53 ios::sync_with_stdio(false);
54 cin.tie(nullptr);
55
56 // Build a fair die probability space: outcomes 1..6 each with 1/6
57 ProbabilitySpace ps;
58 for (int w = 1; w <= 6; ++w) ps.pmf[w] = 1.0/6.0;
59
60 cout << boolalpha;
61 cout << "Valid PMF? " << ps.validate() << "\n"; // should be true
62
63 // Define events
64 unordered_set<int> A = {2,4,6}; // even
65 unordered_set<int> B = {4,5,6}; // >= 4
66 unordered_set<int> C = {1}; // singleton
67 unordered_set<int> D = {2,3}; // disjoint with C
68
69 // Compute basic probabilities
70 double pA = ps.prob(A);
71 double pB = ps.prob(B);
72 auto AUB = set_union_us(A, B);
73 auto AIB = set_intersection_us(A, B);
74 double pAUB = ps.prob(AUB);
75 double pAIB = ps.prob(AIB);
76
77 cout << fixed << setprecision(6);
78 cout << "P(A)=" << pA << ", P(B)=" << pB << "\n";
79 cout << "P(A∩B)=" << pAIB << ", P(AâˆȘB)=" << pAUB << "\n";
80
81 // Verify inclusion–exclusion: P(AâˆȘB) = P(A)+P(B)-P(A∩B)
82 double lhs = pAUB;
83 double rhs = pA + pB - pAIB;
84 cout << "Inclusion–Exclusion holds? diff=" << fabs(lhs - rhs) << "\n";
85
86 // Verify additivity for disjoint events C and D
87 cout << "C and D disjoint? " << disjoint(C, D) << "\n";
88 auto CUD = set_union_us(C, D);
89 double pC = ps.prob(C), pD = ps.prob(D), pCUD = ps.prob(CUD);
90 cout << "P(CâˆȘD)=" << pCUD << ", P(C)+P(D)=" << (pC + pD) << ", diff=" << fabs(pCUD - (pC + pD)) << "\n";
91
92 // Complement rule for A
93 unordered_set<int> Omega = {1,2,3,4,5,6};
94 unordered_set<int> Ac;
95 for (int w : Omega) if (!A.count(w)) Ac.insert(w);
96 double pAc = ps.prob(Ac);
97 cout << "P(A^c)=" << pAc << ", 1-P(A)=" << (1.0 - pA) << ", diff=" << fabs(pAc - (1.0 - pA)) << "\n";
98
99 return 0;
100}
101

We construct a finite probability space for a fair die and implement utilities to compute probabilities of events, unions, intersections, and complements. We validate the PMF (non-negativity and normalization), then check inclusion–exclusion for overlapping events and additivity for disjoint events. Small differences may arise only from floating-point rounding.

Time: O(n) to validate the PMF; O(a+b) to form unions; O(min(a,b)) to form intersections; O(k) to sum an event’s probability.Space: O(n) for the PMF, O(a+b) for materialized unions, and O(k) for event storage.
Monte Carlo: Estimate P(A), P(B), P(A∩B), and Verify Inclusion–Exclusion
1#include <bits/stdc++.h>
2using namespace std;
3
4struct Estimator {
5 uint64_t countA = 0, countB = 0, countAB = 0, trials = 0;
6 void add(bool A, bool B) {
7 trials++;
8 if (A) countA++;
9 if (B) countB++;
10 if (A && B) countAB++;
11 }
12 tuple<double,double,double> probs() const {
13 double N = (double)trials;
14 return {countA/N, countB/N, countAB/N};
15 }
16};
17
18int main() {
19 ios::sync_with_stdio(false);
20 cin.tie(nullptr);
21
22 // RNG setup
23 std::random_device rd;
24 std::mt19937_64 gen(rd());
25 std::uniform_int_distribution<int> die(1,6);
26
27 const uint64_t N = 5'000'000ULL; // 5 million trials
28 Estimator est;
29
30 for (uint64_t i = 0; i < N; ++i) {
31 int x = die(gen); // roll a fair die
32 bool A = (x % 2 == 0); // even
33 bool B = (x >= 4); // at least 4
34 est.add(A, B);
35 }
36
37 auto [pA, pB, pAB] = est.probs();
38 double pAUB_via_counts = pA + pB - pAB; // inclusion–exclusion
39
40 cout << fixed << setprecision(6);
41 cout << "Estimated P(A)=" << pA << ", P(B)=" << pB << ", P(A∩B)=" << pAB << "\n";
42 cout << "Estimated P(AâˆȘB) via inclusion–exclusion = " << pAUB_via_counts << "\n";
43
44 // Theoretical values for comparison
45 cout << "Theoretical: P(A)=0.5, P(B)=1/2, P(A∩B)=1/3, P(AâˆȘB)=2/3\n";
46
47 return 0;
48}
49

This Monte Carlo simulation estimates probabilities by relative frequencies over 5 million die rolls. It verifies inclusion–exclusion numerically by computing P(A) + P(B) − P(A ∩ B) and compares with theoretical values. Increasing N tightens the estimates due to the law of large numbers.

Time: O(N) for N trials, constant work per trial.Space: O(1) additional space, storing only counters.
Bayes’ Theorem: Exact Computation and Simulation Check (Medical Test)
1#include <bits/stdc++.h>
2using namespace std;
3
4struct BayesResult {
5 double prior; // P(D)
6 double sens; // P(+ | D)
7 double spec; // P(- | D^c)
8 double post_pos; // P(D | +)
9 double post_neg; // P(D | -)
10};
11
12BayesResult compute_exact(double prior, double sens, double spec) {
13 double pD = prior;
14 double pDc = 1.0 - pD;
15 double pPosGivenD = sens;
16 double pNegGivenDc = spec;
17 double pPosGivenDc = 1.0 - spec;
18
19 // Law of total probability for P(+)
20 double pPos = pPosGivenD * pD + pPosGivenDc * pDc;
21 double pNeg = 1.0 - pPos;
22
23 // Bayes’ theorem
24 double pDgivenPos = (pPosGivenD * pD) / pPos; // assume pPos>0
25 double pDgivenNeg = ( (1.0 - pPosGivenD) * pD ) / pNeg; // P(D | -)
26
27 return {pD, sens, spec, pDgivenPos, pDgivenNeg};
28}
29
30pair<double,double> simulate(uint64_t N, double prior, double sens, double spec, std::mt19937_64 &gen) {
31 bernoulli_distribution hasDisease(prior);
32 bernoulli_distribution pos_if_D(sens);
33 bernoulli_distribution neg_if_notD(spec);
34
35 uint64_t countPos = 0, countDandPos = 0;
36 uint64_t countNeg = 0, countDandNeg = 0;
37
38 for (uint64_t i = 0; i < N; ++i) {
39 bool D = hasDisease(gen);
40 bool testPos;
41 if (D) {
42 testPos = pos_if_D(gen);
43 } else {
44 // If not diseased, test is positive with (1 - specificity)
45 bernoulli_distribution pos_if_notD(1.0 - spec);
46 testPos = pos_if_notD(gen);
47 }
48 if (testPos) {
49 countPos++;
50 if (D) countDandPos++;
51 } else {
52 countNeg++;
53 if (D) countDandNeg++;
54 }
55 }
56
57 double pDgivenPos_hat = countPos ? (double)countDandPos / (double)countPos : numeric_limits<double>::quiet_NaN();
58 double pDgivenNeg_hat = countNeg ? (double)countDandNeg / (double)countNeg : numeric_limits<double>::quiet_NaN();
59
60 return {pDgivenPos_hat, pDgivenNeg_hat};
61}
62
63int main() {
64 ios::sync_with_stdio(false);
65 cin.tie(nullptr);
66
67 // Example parameters: low prevalence disease
68 double prior = 0.01; // 1% prevalence
69 double sens = 0.95; // sensitivity
70 double spec = 0.90; // specificity
71
72 auto exact = compute_exact(prior, sens, spec);
73
74 cout << fixed << setprecision(6);
75 cout << "Exact P(D|+)= " << exact.post_pos << ", P(D|-)= " << exact.post_neg << "\n";
76
77 std::random_device rd; std::mt19937_64 gen(rd());
78 uint64_t N = 10'000'000ULL; // 10 million patients
79 auto sim = simulate(N, prior, sens, spec, gen);
80 cout << "Simulated P(D|+)= " << sim.first << ", P(D|-)= " << sim.second << "\n";
81
82 return 0;
83}
84

We compute posteriors P(D | +) and P(D | −) exactly using Bayes’ theorem and then validate via Monte Carlo simulation. With low prevalence, even accurate tests yield modest P(D | +), demonstrating the base-rate effect. Large N reduces simulation noise.

Time: Exact: O(1). Simulation: O(N) for N sampled patients.Space: O(1) additional space.
#kolmogorov axioms#probability measure#sample space#sigma-algebra#inclusion-exclusion#conditional probability#bayes theorem#law of total probability#independence#union bound#monte carlo#pmf#finite probability space#chain rule#complement rule