Learning Outcomes
By the end of this section, you should be able to:
- 3.4.1 Describe the basic concepts of probability and apply these concepts to real-world applications in data science.
- 3.4.2 Apply conditional probability and Bayes’ Theorem.
Probability is a numerical measure that assesses the likelihood of occurrence of an event. Probability applications are ubiquitous in data science since many decisions in business, science, and engineering are based on probability considerations. We all use probability calculations every day as we decide, for instance, whether to take an umbrella to work, the optimal route for a morning commute, or the choice of a college major.
Basic Concepts of Probability
We have all used probability in one way or another on a day-to-day basis. Before leaving the house, you might want to know the probability of rain. The probability of obtaining heads on one flip of a coin is one-half, or 0.5.
A data scientist is in interested in expressing probability as a number between 0 and 1 (inclusive), where 0 indicates impossibility (the event will not occur) and 1 indicates certainty (the event will occur). The probability of an event falling between 0 and 1 reflects the degree of uncertainty associated with the event.
Here is some terminology we will be using in probability-related analysis:
- An outcome is the result of a single trial in a probability experiment.
- The sample space is the set of all possible outcomes in a probability experiment.
- An event is some subset of the sample space. For example, an event could be rolling an even number on a six-sided die. This event corresponds to three outcomes, namely rolling a 2, 4, or 6 on the die.
To calculate probabilities, we can use several approaches, including relative frequency probability, which is based on actual data, and theoretical probability, which is based on theoretical conditions.
Relative Frequency Probability
Relative frequency probability is a method of determining the likelihood of an event occurring based on the observed frequency of its occurrence in a given sample or population. A data scientist conducts or observes a procedure and determines the number of times a certain Event occurs. The probability of Event , denoted as , is then calculated based on data that has been collected from the experiment, as follows:
Example 3.14
Problem
A polling organization asks a sample of 400 people if they are in favor of increased funding for local schools; 312 of the respondents indicate they are in favor of increased funding. Calculate the probability that a randomly selected person will be in favor of increased funding for local schools.
Solution
Using the data collected from this polling, a total of 400 people were asked the question, and 312 people were in favor of increased school funding. The probability for a randomly selected person being in favor of increased funding can then be calculated as follows (notice that Event in this example corresponds to the event that a person is in favor of the increased funding):
Example 3.15
Problem
A medical patient is told they need knee surgery, and they ask the doctor for an estimate of the probability of success for the surgical procedure. The doctor reviews data from the past two years and determines there were 200 such knee surgeries performed and 188 of them were successful. Based on this past data, the doctor calculates the probability of success for the knee surgery (notice that Event in this example corresponds to the event that a patient has a successful knee surgery result).
Solution
Using the data collected from the past two years, there were 200 surgeries performed, with 188 successes. The probability can then be calculated as:
The doctor informs the patient that there is a 94% chance of success for the pending knee surgery.
Theoretical Probability
Theoretical probability is the method used when the outcomes in a probability experiment are equally likely—that is, under theoretical conditions.
The formula used for theoretical probability is similar to the formula used for empirical probability. Theoretical probability considers all the possible outcomes for an experiment that are known ahead of time so that past data is not needed in the calculation for theoretical probability.
For example, the theoretical probability of rolling an even number when rolling a six-sided die is (which is , or 0.5). There are 3 outcomes corresponding to rolling an even number, and there are 6 outcomes total in the sample space. Notice this calculation can be done without conducting any experiments since the outcomes are equally likely.
Example 3.16
Problem
A student is working on a multiple-choice question that has 5 possible answers. The student does not have any idea about the correct answer, so the student randomly guesses. What is the probability that the student selects the correct answer?
Solution
Since the student is guessing, each answer choice is equally likely to be selected. There is 1 correct answer out of 5 possible choices. The probability of selecting the correct answer can be calculated as:
Notice in Example 3.16 that probabilities can be written as fractions, decimals, or percentages.
Also note that any probability must be between 0 and 1 inclusive. An event with a probability of zero will never occur, and an event with a probability of 1 is certain to occur. A probability greater than 1 is not possible, and a negative probability is not possible.
Complement of an Event
The complement of an event is the set of all outcomes in the sample space that are not included in the event. The complement of Event is usually denoted by (A prime). To find the probability of the complement of Event , subtract the probability of Event from 1.
Example 3.17
Problem
A company estimates that the probability that an employee will provide confidential information to a hacker is 0.1%. Determine the probability that an employee will not provide any confidential information during a hacking attempt.
Solution
Let Event be the event that the employee will provide confidential information to a hacker. Then the complement of this Event is the event that an employee will not provide any confidential information during a hacking attempt.
There is a 99.9% probability that an employee will not provide any confidential information during a hacking attempt.
Conditional Probability and Bayes’ Theorem
Data scientists are often interested in determining conditional probabilities, or the occurrence of one event that is conditional or dependent on another event. For example, a medical researcher might be interested to know if an asthma diagnosis for a patient is dependent on the patient’s exposure to air pollutants. In addition, when calculating conditional probabilities, we can sometimes revise a probability estimate based on additional information that is obtained. As we’ll see in the following section, Bayes’ Theorem allows new information to be used to refine a probability estimate.
Conditional Probability
A conditional probability is the probability of an event given that another event has already occurred. The notation for conditional probability is , which denotes the probability of Event , given that Event has occurred. The vertical line between and denotes the “given” condition. (In this notation, the vertical line does not denote division).
For example, we might want to know the probability of a person getting a parking ticket given that a person did not put any money in a parking meter. Or a medical researcher might be interested in the probability of a patient developing heart disease given that the patient is a smoker.
If the occurrence of one event affects the probability of occurrence for another event, we say that the events are dependent; otherwise, the events are independent. Dependent events are events where the occurrence of one event affects the probability of occurrence of another event. Independent events are events where the probability of occurrence of one event is not affected by the occurrent of another event. The dependence of events has important implications in many fields such as marketing, engineering, psychology, and medicine.
Example 3.18
Problem
Determine if the two events are dependent or independent:
- Rolling a 3 on one roll of a die, rolling a 4 on a second roll of a die
- Obtaining heads on one flip of a coin and obtaining tails on a second flip of a coin
- Selecting five basketball players from a professional basketball team and a player’s height is greater than 6 feet
- Selecting an Ace from a deck of 52 cards, returning the card back to the original stack, and then selecting a King
- Selecting an Ace from a deck of 52 cards, not returning the card back to the original stack, and then selecting a King
Solution
- The result of one roll does not affect the result for the next roll, so these events are independent.
- The results of one flip of the coin do not affect the results for any other flip of the coin, so these events are independent.
- Typically, basketball players are tall individuals, and so they are more likely to have heights greater than 6 feet as opposed to the general public, so these events are dependent.
- By selecting an Ace from a deck of 52 cards and then replacing the card, this restores the deck of cards to its original state, so the probability of selecting a King is not affected by the selection of the Ace. So these events are independent.
- By selecting an Ace from a deck of 52 cards and then not replacing the card, this will result in only 51 cards remaining in the deck. Thus, the probability of selecting a King is affected by the selection of the Ace, so these events are dependent.
There are several ways to use conditional probabilities in data science applications.
Conditional probability can be defined as follows:
When assessing the conditional probability of , if the two events are independent, this indicates that Event is not affected by the occurrence of Event , so we can write that for independent events.
If we determine that the is not equal to , this indicates that the events are dependent.
implies independent events, where .
implies dependent events.
Example 3.19
Problem
Table 3.2 shows the number of nursing degrees and non-nursing degrees at a university for a specific year, and the data is broken out by age groups. Calculate the probability that a randomly chosen graduate obtained a nursing degree, given that the graduate is in the age group of 23 and older.
Age Group | Nursing Degrees |
Non-Nursing Degrees |
Total |
---|---|---|---|
22 and under | 1036 | 1287 | 2323 |
23 and older | 986 | 932 | 1918 |
Total | 2022 | 2219 | 4241 |
Solution
Since we are given that the group of interest are those graduates in the age group of 23 and older, focus only on the second row in the table.
Looking only at the second row in the table, we are interested in the probability that a randomly chosen graduate obtained a nursing degree. The reduced sample space consists of 1,918 graduates, and 986 of them received nursing degrees. So the probability can be calculated as:
Another method to analyze this example is to rewrite the conditional probability using the equation for , as follows:
We can now use this equation to calculate the probability that a randomly chosen graduate obtained a nursing degree, given that the graduate is in the age group of 23 and older. The probability of and is the probability that a graduate received a nursing degree and is also in the age group of 23 and older. From the table, there are 986 graduates who earned a nursing degree and are also in the age group of 23 and older. Since this number of graduates is out of the total sample size of 4,241, we can write the probability of Events and as:
We can also calculate the probability that a graduate is in the age group of 23 and older. From the table, there are 1,918 graduates in this age group out of the total sample size of 4,241, so we can write the probability for Event as:
Next, we can substitute these probabilities into the formula for , as follows:
Probability of At Least One
The probability of at least one occurrence of an event is often of interest in many data science applications. For example, a doctor might be interested to know the probability that at least one surgery to be performed this week will involve an infection of some type.
The phrase “at least one” implies the condition of one or more successes. From a sample space perspective, one or more successes is the complement of “no successes.” Using the complement rule discussed earlier, we can write the following probability formula:
As an example, we can find the probability of rolling a die 3 times and obtaining at least one four on any of the rolls. This can be calculated by first finding the probability of not observing a four on any of the rolls and then subtracting this probability from 1. The probability of not observing a four on a roll of the die is 5/6. Thus, the probability of rolling a die 3 times and obtaining at least one four on any of the rolls is .
Example 3.20
Problem
From past data, hospital administrators determine the probability that a knee surgery will be successful is 0.89.
- During a certain day, the hospital schedules four knee surgeries to be performed. Calculate the probability that all four of these surgeries will be successful.
- Calculate the probability that none of these knee surgeries will be successful.
- Calculate the probability that at least one of the knee surgeries will be successful.
Solution
- For all four surgeries to be successful, we can interpret that as the first surgery will be successful, and the second surgery will be successful, and the third surgery will be successful, and the fourth surgery will be successful. Since the probability of success for one knee surgery does not affect the probability of success for another knee surgery, we can assume these events are independent. Based on this, the probability that all four surgeries will be successful can be calculated using the probability formula for by multiplying the probabilities together:
There is about a 63% chance that all four knee surgeries will be successful.
- The probability that a knee surgery will be unsuccessful can be calculated using the complement rule. If the probability of a successful surgery is 0.89, then the probability that the surgery will be unsuccessful is 0.11:
Based on this, the probability that all four surgeries will be unsuccessful can be calculated using the probability formula for by multiplying the probabilities together:
Since this is a very small probability, it is very unlikely that none of the surgeries will be successful.
- To calculate the probability that at least one of the knee surgeries will be successful, use the probability formula for “at least one,” which is calculated as the complement of the event “none are successful.”
This indicates there is a very high probability that at least one of the knee surgeries will be successful.
Bayes’ Theorem
Bayes’ Theorem is a statistical technique that allows for the revision of probability estimates based on new information or evidence that allows for more accurate and efficient decision-making in uncertain situations. Bayes’ Theorem is often used to help assess probabilities associated with medical diagnoses such as the probability a patient will develop cancer based on test screening results. This can be important in medical analysis to help assess the impact of a false positive, which is the scenario where the patient does not have the ailment but the screening test gives a false indication that the patient does have the ailment.
Bayes’ Theorem allows the calculation of the conditional probability . There are several forms of Bayes’ Theorem, as shown:
Example 3.21
Problem
Assume that a certain type of cancer affects 3% of the population. Call the event that a person has cancer “Event ,” so:
A patient can undergo a screening test for this type of cancer. Assume the probability of a true positive from the screening test is 75%, which indicates that probability that a person has a positive test result given that they actually have cancer is 0.75. Also assume the probability of a false positive from the screening test is 15%, which indicates that probability that a person has a positive test result given that they do not have cancer is 0.15.
A medical researcher is interested in calculating the probability that a patient actually has cancer given that the screening test shows a positive result.
The researcher is interested in calculating , where Event is the person actually has cancer and Event is the event that the person shows a positive result in the screening test. Use Bayes’ Theorem to calculate this conditional probability.
Solution
From the example, the following probabilities are known:
The conditional probabilities can be interpreted as follows:
Substituting these probabilities into the formula for Bayes’ Theorem results in the following:
This result from Bayes’ Theorem indicates that even if a patient receives a positive test result from the screening test, this does not imply a high likelihood that the patient has cancer. There is only a 13% chance that the patient has cancer given a positive test result from the screening test.