discussed how the mathematical theory of probability is connected to the world through philosophical theories of probability. reviewed the basic tool needed to discuss probability mathematically, Set Theory. This chapter introduces the mathematical theory of probability, in which probability is a function that assigns numbers between 0 and 100% to events, subsets of outcome space. Starting with just three axioms and a few definitions, the mathematical theory develops powerful and beautiful consequences. The chapter presents the axioms of probability and some consequences of the axioms. Conditional probability is then defined, which leads to two useful formulae—the Multiplication Rule and Bayes' Rule—and to the definition of independence. All these concepts and formulae play important roles in the sequel.

The Axioms of Probability

The axioms of probability are mathematical rules that probability must satisfy. Let A and B be events. Let P(A) denote the probability of the event A. The axioms of probability are these three conditions on the function P:

3') If {A₁, A₂, A₃, … } is a partition of the set A, then P(A) = P(A₁) + P(A₂) + P(A₃) + …

Both axiom 3 and axiom 3' hold for every probability function used in this book. Any function P that assigns numbers to subsets of the outcome space S and satisfies the Axioms of Probability is called a probability distribution on S.

Let S be a set containing n>0 elements, for example, S= {1, 2, … , n}. For any subset A of S, define #A to be the number of elements of A. For example, #{} = 0, #{1, 2} = 2, and #{n, n−1, n−2} = 3. The function # is called the cardinality function and #A is called the cardinality of A.

The cardinality of a finite set is the number of elements it contains, so in this example, where S = {1, 2, 3, … , n}, #S = n.

Let P(A) = #A/n, the number of elements in the subset A, divided by the total number of elements in S. Then the function P is called the uniform probability distribution on S. The function P satisfies the axioms of probability. Let us see why.

The number of elements in any subset A of S is at least zero (#A≥0), so P(A) ≥ 0/n = 0. Thus P satisfies Axiom 1.
P(S) = #S/n = n/n = 100%. Thus P satisfies Axiom 2.
If A and B are disjoint, then the number of elements in the union A∪B is the number of elements in A plus the number of elements in B:

#(A∪B) = #A + #B.

Therefore,

P(A∪B) = #(A∪B)/n = (#A + #B)/n = #A/n + #B/n = P(A) + P(B).

Thus P satisfies Axiom 3.

We shall use the uniform probability distribution very often. For example, we shall use the uniform probability distribution on the outcome space S = {0, 1} to model the number of heads in a single toss of a fair coin. We shall use the uniform probability distribution on the outcome space S = {1, 2, … , 6} to model the number of spots that show on the top face of a fair die when it is rolled. We shall use the uniform probability distribution on the outcome space S of the 36 pairs

to model rolls of a fair pair of dice. We shall use the uniform probability distribution on the outcome space S of all 52! permutations of a deck of cards to model shuffling the deck well. We shall use the uniform probability distribution to model drawing a ticket from a well-stirred box of numbered tickets; in that case, the outcome space S is the collection of numbers written on the tickets (including duplicates as often as they occur on the tickets). The uniform probability distribution is the same as the distribution postulated by the Theory of Equally Likely Outcomes (if the outcomes are defined suitably).

Consider a random trial that can result in failure or success. Let 0 stand for failure, and let 1 stand for success. Then we can consider the outcome space to be S = {0, 1}. For any number p between 0 and 100%, define the function P as follows:

P({1}) = p,
P({0}) = 100% − p,
P(S) = 100%,
P({}) = 0.

Then P is a probability distribution on S, as we can verify by checking that it satisfies the axioms:

Because p is between 0 and 100%, so is 100% − p. The outcome space S has but four subsets: {}, {0}, {1}, and {0, 1}. The values assigned to them by P are 0, 1 − p, p, and 100%, respectively. All these numbers are at zero or larger, so P satisfies Axiom 1.
By definition, P(S) = 100%, so P satisfies Axiom 2.
The empty set and any other set are disjoint, and it is easy to see that

P({}∪A) = P({}) + P(A) for any subset A of S.

The only other pair of disjoint events in S is {0} and {1}. We can calculate

P({0}∪{1}) = P(S) = 100% = (100% − p) + p = P({0}) + P({1}).

Thus P satisfies Axiom 3.

In later chapters this probability distribution will be the building block for more complex distributions involving sequences of trials.

Consequences of the Axioms of Probability

Everything that is mathematically true of probability is a consequence of the Axioms of Probability, and of further definitions. For example, if S is countable—that is, if its elements can be put into 1:1 correspondence with a subset of the integers—the sum of the probabilities of the elements of S must be 100%. This follows from Axioms 2 and 3': Axiom 3' tells us that because the elements of S partition S, the probability of S is the sum of the probabilities of the elements of S. Axiom 2 tells us that that sum must be 100%.

The Complement Rule

Another consequence of the axioms is the Complement Rule: The probability that an event occurs is always equal to 100% minus the probability that the event does not occur:

The Complement Rule is extremely useful, because in many problems it is much easier to calculate the probability that A does not occur than to calculate the probability that A does occur. The complement rule can be derived from the axioms: the union of A and its complement A^c is S (either A happens or it does not, and there is no other possibility), so

by axiom 2. The event A and its complement are disjoint (if "A does not happen" happens, A does not happen; if A happens, "A does not happen" does not happen), so

Consider tossing a fair coin 10 times in such a manner that every sequence of 10 heads and/or tails is equally likely. What is the probability that the coin lands heads at least once?

This would be quite difficult to calculate directly, because there are very many ways in which the coin can land heads at least once. However, there is only one way the coin can fail to land heads at least once: All the tosses must yield tails. That makes it easy to calculate the probability that the coin lands heads at least once, using the Complement Rule.

Every sequence of heads and tails is equally likely, by assumption: The probability distribution is the uniform distribution on sequences of 10 heads and/or tails, so the probability of any particular sequence is 100%/(total number of sequences). By the Fundamental Rule of Counting, there are

2×2× … ×2 = 2¹⁰ = 1,024

sequences of 10 heads and tails.

One of those sequences is (tails, tails, … , tails), so the probability that the coin lands tails in all 10 tosses is

100%/2¹⁰ = 0.0977%.

By the complement rule, the probability that the coin lands heads at least once is therefore

100% − 0.0977% = 99.902%.

A special case of the Complement Rule is that the probability of the empty set is always zero (P({}) = 0%), because P(S) = 100%, and S^c= {}.

An event A whose probability is 100% is said to be certain or sure. S is certain.

The Probability of the Union of Two Events

The third Axiom of Probability tells us how to find the probability of a union of disjoint events in terms of their individual probabilities. The Axioms can be used together to find a formula for the probability of a union of two events that are not necessarily disjoint in terms of the probability of each of the events and the probability of their intersection.

That is, the three sets partition A∪B. The third axiom implies that the chance that either A or B occurs is

This would be equal to P(A∪B), but for the fact that P(AB) is counted twice, not once. It follows that in general

This is a true statement, but it is not one of the axioms of probability. In the special case that AB = {}, this result is equivalent to the third axiom, because P({}) = 0%.

Bounds on Probabilities

More generally, if {A₁, A₂, A₃, … } is a countable collection of events, then

0 ≤ P(A₁A₂ A₃ …) ≤ P(A_k) ≤ P(A₁ ∪ A₂ ∪ A₃ ∪ …) ≤ P(A₁) + P(A₂) + P(A₃) + … . , for k = 1, 2, 3, … .

Useful Consequences of the Axioms of Probability

P({}) = 0.
For any event A, P(A^c) = 100% − P(A).
If S = { A₁, A₂, A₃, … , A_n }, then P(A₁) + P(A₂) + P(A₃) + … + P(A_n) = 100%.
If S = { A₁, A₂, A₃, … }, then P(A₁) + P(A₂) + P(A₃) + … = 100%.
For any events A and B,
- P(A∪B) = P(A) + P(B) − P(AB).
- 0 ≤ P(AB) ≤ P(A) ≤ P(A∪B) ≤ P(A) + P(B).
If {A₁, A₂, A₃, … } is a countable collection of events, then for k = 1, 2, 3, … ,
0 ≤ P(A₁A₂ A₃ …) ≤ P(A_k) ≤ P(A₁ ∪ A₂ ∪ A₃ ∪ …) ≤ P(A₁) + P(A₂) + P(A₃) + … .

Probability is analogous to area or volume or mass. Consider the unit square, each of whose sides has length 1. Its total area is 1×1 = 1 = 100%. Let's call the square S, just like outcome space. Now consider regions inside the square S (subsets of S). The area of any such region is at least zero, the area of S is 100%, and the area of the union of two regions is the sum of their areas, if they do not overlap (i.e., if they are disjoint). These facts are direct analogues of the axioms of probability, and we shall often use this model to get intuition about probability.

It might help your intuition to consider the square S to be a dartboard. The experiment consists of throwing a dart at the board once. The event A occurs if the dart sticks in the set A. The event AB occurs if the dart sticks in both A and B on that one toss. Clearly, AB cannot occur unless A and B overlap—the dart cannot stick in two places at once. A∪B occurs if the dart sticks in either A or B (or both) on that one throw. A and B need not overlap for A∪B to occur.

This analogy is also useful for thinking about the connection between Set Theory and logical implication. If A is a subset of B, the occurrence of A implies the occurrence of B; We shall sometimes say that A implies B. In the dartboard model, the dart cannot stick in A without sticking in B as well, so if A occurs, B must occur also. If A implies B, AB=A, so P(AB)=P(A). If AB = {}, A implies B^c and B implies A^c: If the dart sticks in A it did not stick in B, and vice versa. If A implies B, then if B does not occur A cannot occur either: B^c implies A^c, so B^c is a subset of A^c.

The following exercises test your understanding of the axioms of probability and their consequences.

Videos of Exercises

(Reminder: Examples and exercises may vary when the page is reloaded; the video shows only one version.)

Conditioning

In probability, conditioning means incorporating new restrictions on the outcome of an experiment: updating probabilities to take into account new information. This section describes conditioning, and how conditional probability can be used to solve complicated problems.

Conditional Probability

The conditional probability of A given B, P(A | B), is the probability of the event A, updated on the basis of the knowledge that the event B occurred. Suppose that AB = {} (A and B are disjoint). Then if we learn that B occurred, we know A did not occur, so we should revise the probability of A to be zero (the conditional probability of A given B is zero). On the other hand, suppose that AB = B (B is a subset of A, so B implies A). Then if we learn that B occurred, we know A must have occurred as well, so we should revise the probability of A to be 100% (the conditional probability of A given B is 100%). For in-between cases, the conditional probability of A given B is defined to be

provided P(B) is not zero (division by zero is undefined). "P(A | B)" is pronounced "the (conditional) probability of A given B."

Why does this formula make sense? First of all, note that it does agree with the intuitive answers we found above: if AB = {}, then P(AB) = 0, so

Similarly, if we learned that S occurred, this is not really new information (by definition, S always occurs, because it contains all possible outcomes), so we would like P(A | S) to equal P(A). That is how it works out: A<S = A, so

Now suppose that A and B are not disjoint. Then if we learn that B occurred, we can restrict attention to just those outcomes that are in B, and disregard the rest of S, so we have a new outcome space that is just B. We need P(B) = 100% to consider B an outcome space; we can make this happen by dividing all probabilities by P(B). For A to have occurred in addition to B requires that AB occurred, so the conditional probability of A given B is P(AB)/P(B), just as we defined it above.

We shall deal two cards from a well shuffled deck. What is the conditional probability that the second card is an Ace (event A), given that the first card is an Ace (event B)?

Solution. By definition, this is P(AB)/P(B). The (unconditional) chance that the first card is an Ace is 100%/13 = 7.7%, because there are 13 possible faces for the first card, and all are equally likely (this is what we mean by a well well-shuffled deck).

The chance that both cards are Aces can be computed as follows: From the four suits, we need to pick two; there are ₄C₂ = 6 ways that can happen. The total number of ways of picking two cards from the deck is ₅₂C₂ = 52×51/2 = 1326, so the chance that the two cards are both Aces is (6/1326)×100% = 0.5%. The conditional probability that the second card is an Ace given that the first card is an Ace is thus 0.5%/7.7% = 5.9%. As we might expect, it is somewhat lower than the chance that the first card is an Ace, because we know one of the Aces is gone.

We could approach this more intuitively as well: Given that the first card is an Ace, the second card is an Ace too if it is one of the three remaining Aces among the 51 remaining cards. These possibilities are equally likely if the deck was shuffled well, so the chance is 3/51 × 100% = 5.9%.

Conditional probability behaves just like probability: It satisfies the axioms of probability and all their consequences. Thus, for example,

Independence

Two events are independent if learning that one occurred gives us no information about whether the other occurred. That is, A and B are independent if P(A | B) = P(A) and P(B | A) = P(B). A slightly more general way to write this is that A and B are independent if P(AB) = P(A)×P(B). (This covers the cases that P(A), P(B) or both are equal to zero, while the definition of independence in terms of conditional probability requires the probability in the denominator to be different from zero.) To reiterate: Two events are independent if and only if the probability that both events happen simultaneously is the product of their unconditional probabilities. If two events are not independent, they are dependent.

Independence and Mutual Exclusivity Are Different! In fact, the only way two events can be both mutually exclusive and independent is if at least one of them has probability equal to zero. If A and B are mutually exclusive, learning that B happened tells us that A did not happen. This is clearly informative: The conditional probability of A given B is zero! This changes the (conditional) probability of A unless its (unconditional) probability was zero.

Independent events bear a special relationship to each other. Independence is a very precise point between being disjoint (so that the occurrence of one event implies that the other did not occur), and one event being a subset of the other (so that the occurrence of one event implies the occurrence of the other). Here is a summary of the contrast between independent events and mutually exclusive events:

contains a Venn diagram that represents two events, A and B, as subsets of a rectangle S. The probabilities of the events are proportional to their areas. Initially, the probability of A is 30% and the probability of B is 20%. The figure also shows the probability of AB and of A∪B. Try to make A and B independent by dragging them to make the area of their intersection equal to the product of their areas, so that P(AB) = P(A)×P(B) = 30%×20% = 6%. It is hard to get just the right amount of overlap: Independence is a very special relationship between events.

What kinds of events are (generally assumed to be) independent? The outcomes of successive fair tosses of a fair coin, the outcomes of random draws from a box with replacement, etc. Draws without replacement are dependent, because what can happen on a given draw depends on what happens on previous draws. The next two examples illustrate the contrast between independent and dependent events.

Suppose I have a box with four tickets in it, labeled 1, 2, 3, and 4. I stir the tickets and then draw one from the box, stir the remaining tickets again without returning the ticket I drew the first time, and draw another ticket. Consider the event A = {I get the ticket labeled 1 on the first draw} and the event B = {I get the ticket labeled 2 on the second draw}. Are A and B dependent or independent?

Solution: The chance that I get the 1 on the first draw is 25%. The chance that I get the 2 on the second draw is 25%. The chance that I get the 2 on the second draw given that I get the 1 on the first draw is 33%, which is much larger than the unconditional chance that I draw the 2 the second time. Thus A and B are dependent.

Now suppose that I replace the ticket I got on the first draw and stir the tickets again before drawing the second time. Then the chance that I get the 1 on the first draw is 25%, the chance that I get the 2 on the second draw is 25%, and the conditional chance that I get the 2 on the second draw given that I drew the 1 the first time is also 25%. A and B are thus independent if I draw with replacement.

Two fair dice are rolled independently; one is blue, the other is red. What is the chance that the number of spots that show on the red die is less than the number of spots that show on the blue die?

Solution: The event that the number of spots that show on the red die is less than the number that show on the blue die can be broken up into mutually exclusive events, according to the number of spots that show on the blue die. The chance that the number of spots that show on the red die is less than the number that show on the blue die is the sum of the chances of those simpler events. If only one spot shows on the blue die, the number that shows on the red die cannot be smaller, so the probability is zero. If two spots show on the blue die, the number that shows on the red die is smaller if the red die shows exactly one spot. Because the numbers of spots that show on the blue and red dice are independent, the chance that the blue die shows two spots and the red die shows one spot is (1/6)(1/6) = 1/36. If three spots show on the blue die, the number that shows on the red die is smaller if the red die shows one or two spots. The chance that the blue die shows three spots and the red die shows one or two spots is (1/6)(2/6) = 2/36. If four spots show on the blue die, the number that show on the red die is smaller if the red die shows one, two, or three spots; the chance that the blue die shows four spots and the red die shows one, two, or three spots is (1/6)(3/6) = 3/36.

Proceeding similarly for the cases that the blue die shows five or six spots gives the ultimate result:

P(red die shows fewer spots than the blue die) = 1/36 + 2/36 + 3/36 + 4/36 + 5/36 = 15/36.

Alternatively, one could just count the ways: There are 36 possibilities, which can be written in a square table as follows.

The 36 possible outcomes of rolling two dice
	Blue Die
R e d D i e	1,1	1,2	1,3	1,4	1,5	1,6
	2,1	2,2	2,3	2,4	2,5	2,6
	3,1	3,2	3,3	3,4	3,5	3,6
	4,1	4,2	4,3	4,4	4,5	4,6
	5,1	5,2	5,3	5,4	5,5	5,6
	6,1	6,2	6,3	6,4	6,5	6,6

The outcomes above the diagonal comprise the event whose probability we seek. There are 36 outcomes in all, of which 6 are on the diagonal. Half of the remaining 36-6=30 are above the diagonal; half of 30 is 15. The 36 outcomes are equally likely, so the chance is 15/36. The outcomes highlighted in yellow—(1,4), (2,4) and (3,4)—comprise one of the mutually exclusive pieces used in the computation in namely, the three ways the red die can show a smaller number of spots than the blue die, when the blue die shows exactly 4 spots.

The Multiplication Rule

We can rearrange the definition of conditional probability to solve for the probability that both A and B occur (that AB occurs) in terms of the probability that B occurs and the conditional probability of A given B:

This is called the Multiplication Rule. The following two examples illustrate the Multiplication Rule.

A deck of cards is shuffled well, then two cards are drawn. What is the chance that both cards are aces?

Solution: Apply the Multiplication Rule.

P(card 1 is an Ace and card 2 is an Ace) = P(card 2 is an Ace | card 1 is an Ace)×P(card 1 is an Ace)

= 3/51 × 4/52 = 0.5%.

You can see that the Multiplication Rule can save you a lot of time!

Suppose there is a 50% chance that you catch the 8:00am bus. If you catch the bus, you will be on time. If you miss the bus, there is a 70% chance that you will be late. What is the chance that you will be late?

Solution: Apply the Multiplication Rule.

P(late) = P(miss the bus and late) = P(late | miss the bus) × P(miss the bus)

= 0.5 × 0.7 = 35%.

One Example of Exercise 17-8
(Reminder: Examples and exercises may vary when the page is reloaded; the video shows only one version.)

Bayes' Rule

Bayes' Rule is a formula that expresses P(A | B) in terms of P(B | A), P(B | A^c), P(A), and P(A^c):

Bayes' Rule is useful to find the conditional probability of A given B in terms of the conditional probability of B given A, which is the more natural quantity to measure in some problems, and the easier quantity to compute in some problems. For example, in screening for a disease, the natural way to calibrate a test is to see how well it does at detecting the disease when the disease is present, and to see how often it raises false alarms when the disease is not present. These are, respectively, the conditional probability of detecting the disease given that the disease is present, and the conditional probability of incorrectly raising an alarm given that the disease is not present. However, the interesting quantity for an individual is the conditional chance that he or she has the disease, given that the test raised an alarm. An example will help.

Suppose that 10% of a given population has benign chronic flatulence. Suppose that there is a standard screening test for benign chronic flatulence that has a 90% chance of correctly detecting that one has the disease, and a 10% chance of a false positive (erroneously reporting that one has the disease when one does not). We pick a person at random from the population (so that everyone has the same chance of being picked) and test him/her. The test is positive. What is the chance that the person has the disease?

Solution: We shall combine several things we have learned. Let D be the event that the person has the disease, and let T be the event that the person tests positive for the disease. The problem statement told us that:

P(D) = 10%.
P(T | D) = 90%.
P(T | D^c) = 10%.

The problem asks us to find P(D | T) = P(DT)/P(T). We shall find P(T) by partitioning T into two mutually exclusive pieces, DT and D^cT, corresponding to testing positive and having the disease (DT) and testing positive falsely (D^cT). Then P(T) is the sum of P(DT) and P(D^cT). We will find those two probabilities using the Multiplication Rule. We need P(DT) for the numerator, and it will be one of the terms in the denominator as well. The probability of DT is, by the Multiplication Rule,

P(DT) = P(T | D) × P(D) = 90% × 10% = 9%.

The probability of D^cT is, by the multiplication rule and the complement rule,

P(D^cT) = P(T | D^c) × P(D^c) = P(T | D^c) × (100% − P(D) ) = 10% × 90% = 9%.

By the third axiom,

P(T) = P(DT) + P(D^cT) = 9% + 9% = 18%,

because DT and D^cT are mutually exclusive. Finally, plugging in the definition of P(D | T) gives:

P(D | T) = P(DT)/P(T) = 9%/18% = 50%.

Because only a small fraction of the population actually have benign chronic flatulence, the chance that a positive test result for someone selected at random from the population is a false positive is 50%, even though the test is 90% accurate. The computation we just made is equivalent to using Bayes' rule:

P(D | T) = P(T | D)×P(D)/(P(T | D)×P(D) + P(T | D^c)×P(D^c) )

= 90%×10%/( 90%×10% + 10%×90%)

= 50%.

The Base Rate Fallacy consists of ignoring P(A) or P(B) in computing P(B | A) from P(A | B) and P(A | B^c). For instance, in the example above, the base rate for chronic benign flatulence is 10%. The test is 90% accurate (both for false positives and for false negatives). The base rate fallacy is to conclude that since the test is 90% accurate, it must be true that 90% of people who test positive in fact have the disease—ignoring the base rate of the disease in the population and the frequency of false positive test results. We just saw that that conclusion is wrong: if people are tested at random, of those who test positive, only 50% have the disease, on average.

The following exercises check your ability to work with conditional probability, the Multiplication Rule, and Bayes' Rule.

Videos of Exercises

(Reminder: Examples and exercises may vary when the page is reloaded; the video shows only one version.)

Summary

The Axioms of Probability are mathematical rules that must be followed in assigning probabilities to events: The probability of an event cannot be negative, the probability that something happens must be 100%, and if two events cannot both occur, the probability that either occurs is the sum of the probabilities that each occurs. A function that assigns numbers to events and satisfies the axioms is called a probability distribution.

The axioms have numerous consequences, including the following: The probability of the empty set is zero. The probability that a given event does not occur is 100% minus the probability that the event occurs. The probability that either of two events occurs is the sum of the probabilities that each occurs, minus the probability that both occur. The probability that either of two events occurs is at least as large as the probability that each occurs, and no larger than the sum of the probabilities that each occurs. The probability that two events both occur is no larger than either of their individual probabilities.

Conditioning describes updating probabilities to incorporate new knowledge. For example, how should we update the probability of the event A if we learn that the event B occurs? The updated probability is the conditional probability of A given B, which is equal to the probability that A and B both occur, divided by the probability that B occurs, provided that the probability that B occurs is not zero. Conditional probability satisfies the axioms of probability.

Rearranging the definition of conditional probability yields the Multiplication Rule: The probability that A and B both occur is the conditional probability of A given B, times the probability that B occurs. Two events are independent if the occurrence of one is uninformative with respect to the occurrence of the other: if P(A | B) = P(A). A slightly more general definition is that A and B are independent if P(AB) = P(A)×P(B).

Bayes' Rule expresses P(A | B) in terms of P(B | A), P(B | A^c), and P(A), which in some problems are easier to calculate than P(A | B). Bayes' Rule says that

Probability: Axioms and Fundaments

The Axioms of Probability

Consequences of the Axioms of Probability

The Complement Rule

The Probability of the Union of Two Events

Bounds on Probabilities

Videos of Exercises

Conditioning

Conditional Probability

Independence

The Multiplication Rule

Bayes' Rule

Videos of Exercises

Summary

Key Terms