Skip to ContentGo to accessibility pageKeyboard shortcuts menu
OpenStax Logo
Principles of Data Science

3.4 Probability Theory

Principles of Data Science3.4 Probability Theory

Learning Outcomes

By the end of this section, you should be able to:

  • 3.4.1 Describe the basic concepts of probability and apply these concepts to real-world applications in data science.
  • 3.4.2 Apply conditional probability and Bayes’ Theorem.

Probability is a numerical measure that assesses the likelihood of occurrence of an event. Probability applications are ubiquitous in data science since many decisions in business, science, and engineering are based on probability considerations. We all use probability calculations every day as we decide, for instance, whether to take an umbrella to work, the optimal route for a morning commute, or the choice of a college major.

Basic Concepts of Probability

We have all used probability in one way or another on a day-to-day basis. Before leaving the house, you might want to know the probability of rain. The probability of obtaining heads on one flip of a coin is one-half, or 0.5.

A data scientist is in interested in expressing probability as a number between 0 and 1 (inclusive), where 0 indicates impossibility (the event will not occur) and 1 indicates certainty (the event will occur). The probability of an event falling between 0 and 1 reflects the degree of uncertainty associated with the event.

Here is some terminology we will be using in probability-related analysis:

  • An outcome is the result of a single trial in a probability experiment.
  • The sample space is the set of all possible outcomes in a probability experiment.
  • An event is some subset of the sample space. For example, an event could be rolling an even number on a six-sided die. This event corresponds to three outcomes, namely rolling a 2, 4, or 6 on the die.

To calculate probabilities, we can use several approaches, including relative frequency probability, which is based on actual data, and theoretical probability, which is based on theoretical conditions.

Relative Frequency Probability

Relative frequency probability is a method of determining the likelihood of an event occurring based on the observed frequency of its occurrence in a given sample or population. A data scientist conducts or observes a procedure and determines the number of times a certain Event AA occurs. The probability of Event AA, denoted as P(A)P(A), is then calculated based on data that has been collected from the experiment, as follows:

P(A)=number of times Event A has occurednumber of times the procedure was repeatedP(A)=number of times Event A has occurednumber of times the procedure was repeated

Example 3.14

Problem

A polling organization asks a sample of 400 people if they are in favor of increased funding for local schools; 312 of the respondents indicate they are in favor of increased funding. Calculate the probability that a randomly selected person will be in favor of increased funding for local schools.

Example 3.15

Problem

A medical patient is told they need knee surgery, and they ask the doctor for an estimate of the probability of success for the surgical procedure. The doctor reviews data from the past two years and determines there were 200 such knee surgeries performed and 188 of them were successful. Based on this past data, the doctor calculates the probability of success for the knee surgery (notice that Event AA in this example corresponds to the event that a patient has a successful knee surgery result).

Theoretical Probability

Theoretical probability is the method used when the outcomes in a probability experiment are equally likely—that is, under theoretical conditions.

The formula used for theoretical probability is similar to the formula used for empirical probability. Theoretical probability considers all the possible outcomes for an experiment that are known ahead of time so that past data is not needed in the calculation for theoretical probability.

Theoretical Probability=number of outcomes for the event of interesttotal number of outcomes in the sample spaceTheoretical Probability=number of outcomes for the event of interesttotal number of outcomes in the sample space

For example, the theoretical probability of rolling an even number when rolling a six-sided die is 3636 (which is 1212, or 0.5). There are 3 outcomes corresponding to rolling an even number, and there are 6 outcomes total in the sample space. Notice this calculation can be done without conducting any experiments since the outcomes are equally likely.

Example 3.16

Problem

A student is working on a multiple-choice question that has 5 possible answers. The student does not have any idea about the correct answer, so the student randomly guesses. What is the probability that the student selects the correct answer?

Notice in Example 3.16 that probabilities can be written as fractions, decimals, or percentages.

Also note that any probability must be between 0 and 1 inclusive. An event with a probability of zero will never occur, and an event with a probability of 1 is certain to occur. A probability greater than 1 is not possible, and a negative probability is not possible.

Complement of an Event

The complement of an event is the set of all outcomes in the sample space that are not included in the event. The complement of Event AA is usually denoted by AA (A prime). To find the probability of the complement of Event AA, subtract the probability of Event AA from 1.

P(A)=1P(A)P(A)=1P(A)

Example 3.17

Problem

A company estimates that the probability that an employee will provide confidential information to a hacker is 0.1%. Determine the probability that an employee will not provide any confidential information during a hacking attempt.

Conditional Probability and Bayes’ Theorem

Data scientists are often interested in determining conditional probabilities, or the occurrence of one event that is conditional or dependent on another event. For example, a medical researcher might be interested to know if an asthma diagnosis for a patient is dependent on the patient’s exposure to air pollutants. In addition, when calculating conditional probabilities, we can sometimes revise a probability estimate based on additional information that is obtained. As we’ll see in the following section, Bayes’ Theorem allows new information to be used to refine a probability estimate.

Conditional Probability

A conditional probability is the probability of an event given that another event has already occurred. The notation for conditional probability is P(A|B)P(A|B), which denotes the probability of Event AA, given that Event BB has occurred. The vertical line between AA and BB denotes the “given” condition. (In this notation, the vertical line does not denote division).

For example, we might want to know the probability of a person getting a parking ticket given that a person did not put any money in a parking meter. Or a medical researcher might be interested in the probability of a patient developing heart disease given that the patient is a smoker.

If the occurrence of one event affects the probability of occurrence for another event, we say that the events are dependent; otherwise, the events are independent. Dependent events are events where the occurrence of one event affects the probability of occurrence of another event. Independent events are events where the probability of occurrence of one event is not affected by the occurrent of another event. The dependence of events has important implications in many fields such as marketing, engineering, psychology, and medicine.

Example 3.18

Problem

Determine if the two events are dependent or independent:

  1. Rolling a 3 on one roll of a die, rolling a 4 on a second roll of a die
  2. Obtaining heads on one flip of a coin and obtaining tails on a second flip of a coin
  3. Selecting five basketball players from a professional basketball team and a player’s height is greater than 6 feet
  4. Selecting an Ace from a deck of 52 cards, returning the card back to the original stack, and then selecting a King
  5. Selecting an Ace from a deck of 52 cards, not returning the card back to the original stack, and then selecting a King

There are several ways to use conditional probabilities in data science applications.

Conditional probability can be defined as follows:

P(A|B)=P(AandB)P(B), where P(B)0P(A|B)=P(AandB)P(B), where P(B)0

When assessing the conditional probability of P(A|B)P(A|B), if the two events are independent, this indicates that Event AA is not affected by the occurrence of Event BB, so we can write that P(A|B)=P(A)P(A|B)=P(A) for independent events.

If we determine that the P(A|B)P(A|B) is not equal to P(A)P(A), this indicates that the events are dependent.

P(A|B)=P(A)P(A|B)=P(A) implies independent events, where P(B)0P(B)0.

P(A|B)P(A)P(A|B)P(A) implies dependent events.

Example 3.19

Problem

Table 3.2 shows the number of nursing degrees and non-nursing degrees at a university for a specific year, and the data is broken out by age groups. Calculate the probability that a randomly chosen graduate obtained a nursing degree, given that the graduate is in the age group of 23 and older.

Age Group Nursing
Degrees
Non-Nursing
Degrees
Total
22 and under 1036 1287 2323
23 and older 986 932 1918
Total 2022 2219 4241
Table 3.2 Number of Nursing and Non-Nursing Degrees at a University by Age Group

Probability of At Least One

The probability of at least one occurrence of an event is often of interest in many data science applications. For example, a doctor might be interested to know the probability that at least one surgery to be performed this week will involve an infection of some type.

The phrase “at least one” implies the condition of one or more successes. From a sample space perspective, one or more successes is the complement of “no successes.” Using the complement rule discussed earlier, we can write the following probability formula:

P(at least one success)=1P(no successes)P(at least one success)=1P(no successes)

As an example, we can find the probability of rolling a die 3 times and obtaining at least one four on any of the rolls. This can be calculated by first finding the probability of not observing a four on any of the rolls and then subtracting this probability from 1. The probability of not observing a four on a roll of the die is 5/6. Thus, the probability of rolling a die 3 times and obtaining at least one four on any of the rolls is 1(56)3=0.4211(56)3=0.421.

Example 3.20

Problem

From past data, hospital administrators determine the probability that a knee surgery will be successful is 0.89.

  1. During a certain day, the hospital schedules four knee surgeries to be performed. Calculate the probability that all four of these surgeries will be successful.
  2. Calculate the probability that none of these knee surgeries will be successful.
  3. Calculate the probability that at least one of the knee surgeries will be successful.

Bayes’ Theorem

Bayes’ Theorem is a statistical technique that allows for the revision of probability estimates based on new information or evidence that allows for more accurate and efficient decision-making in uncertain situations. Bayes’ Theorem is often used to help assess probabilities associated with medical diagnoses such as the probability a patient will develop cancer based on test screening results. This can be important in medical analysis to help assess the impact of a false positive, which is the scenario where the patient does not have the ailment but the screening test gives a false indication that the patient does have the ailment.

Bayes’ Theorem allows the calculation of the conditional probability P(A|B)P(A|B). There are several forms of Bayes’ Theorem, as shown:

P(A|B)=P(A)·P(B|A)P(B)P(A|B)=P(A)·P(B|A)P(A)·P(B|A)+P(A')·P(B|A')P(A|B)=P(A)·P(B|A)P(B)P(A|B)=P(A)·P(B|A)P(A)·P(B|A)+P(A')·P(B|A')

Example 3.21

Problem

Assume that a certain type of cancer affects 3% of the population. Call the event that a person has cancer “Event AA,” so:

P(A)=0.03P(A)=0.03

A patient can undergo a screening test for this type of cancer. Assume the probability of a true positive from the screening test is 75%, which indicates that probability that a person has a positive test result given that they actually have cancer is 0.75. Also assume the probability of a false positive from the screening test is 15%, which indicates that probability that a person has a positive test result given that they do not have cancer is 0.15.

A medical researcher is interested in calculating the probability that a patient actually has cancer given that the screening test shows a positive result.

The researcher is interested in calculating P(A|B)P(A|B), where Event AA is the person actually has cancer and Event BB is the event that the person shows a positive result in the screening test. Use Bayes’ Theorem to calculate this conditional probability.

Citation/Attribution

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution-NonCommercial-ShareAlike License and you must attribute OpenStax.

Attribution information
  • If you are redistributing all or part of this book in a print format, then you must include on every physical page the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
  • If you are redistributing all or part of this book in a digital format, then you must include on every digital page view the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
Citation information

© Dec 19, 2024 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.