Learning Outcomes
By the end of this section, you should be able to:
- 3.5.1 Describe fundamental aspects of probability distributions.
- 3.5.2 Apply discrete probability distributions including binomial and Poisson distributions.
- 3.5.3 Apply continuous probability distributions including exponential and normal distributions.
- 3.5.4 Use Python to apply various probability distributions for probability applications.
Probability distributions are used to model various scenarios to help with probability analysis and predictions, and they are used extensively to help formulate probability-based decisions. For example, if a doctor knows that the weights of newborn infants follow a normal (bell-shaped) distribution, the doctor can use this information to help identify potentially underweight newborn infants, which might indicate a medical condition warranting further investigation. Using a normal distribution, the doctor can calculate that only a small percentage of babies have weights below a certain threshold, which might prompt the doctor to further investigate the cause of the low weight. Or a medical researcher might be interested in the probability that a person will have high blood pressure or the probability that a person will have type O blood.
Overview of Probability Distributions
To begin our discussion of probability distributions, some terminology will be helpful:
- Random variable—a variable where a single numerical value is assigned to a specific outcome from an experiment. Typically the letter is used to denote a random variable. For example, assign the numerical values 1, 2, 3, … 13 to the cards selected from a standard 52-card deck of Ace, 2, 3, … 10, Jack, Queen, King. Notice we cannot use “Jack” as the value of the random variable since by definition a random variable must be a numerical value.
- Discrete random variable—a random variable is considered discrete if there is a finite or countable number of values that the random variable can take on. (If there are infinitely many values, the number of values is countable if it possible to count them individually.) Typically, a discrete random variable is the result of a count of some kind. For example, if the random variable represents the number of cars in a parking lot, then the values that x can take on can only be whole numbers since it would not make sense to have cars in the parking lot.
- Continuous random variable—a random variable is considered continuous if the value of the random variable can take on any value within an interval. Typically, a continuous random variable is the result of a measurement of some kind. For example, if the random variable represents the weight of a bag of apples, then can take on any value such as pounds of apples.
To summarize, the difference between discrete and continuous probability distributions has to do with the nature of the random variables they represent. Discrete probability distributions are associated with variables that take on a finite or countably infinite number of distinct values. Continuous probability distributions deal with random variables that can take on any value within a given range or interval. It is important to identify and distinguish between discrete and continuous random variables since different statistical methods are used to analyze each type.
Example 3.22
Problem
A coin is flipped three times. Determine a possible random variable that can be assigned to represent the number of heads observed in this experiment.
Solution
One possible random variable assignment could be to let count the number of heads observed in each possible outcome in the sample space. When flipping a coin three times, there are eight possible outcomes, and will be the numerical count corresponding to the number of heads observed for each outcome. Notice that the possible values for the random variable are 0, 1, 2 and 3, as shown in Table 3.3.
Result for Flip #1 | Result for Flip #2 | Result for Flip #3 | Value of Random Variable |
---|---|---|---|
Heads | Heads | Heads | 3 |
Heads | Heads | Tails | 2 |
Heads | Tails | Heads | 2 |
Heads | Tails | Tails | 1 |
Tails | Heads | Heads | 2 |
Tails | Heads | Tails | 1 |
Tails | Tails | Heads | 1 |
Tails | Tails | Tails | 0 |
Example 3.23
Problem
Identify the following random variables as either discrete or continuous random variables:
- The amount of gas, in gallons, used to fill a gas tank
- Number of children per household in a certain neighborhood
- Number of text messages sent by a certain student during a particular day
- Number of hurricanes affecting Florida in a given year
- The amount of rain, in inches, in Detroit, Michigan, in a certain month
Solution
- The number of gallons of gas used to fill a gas tank can take on any value, such as 12.3489, so this represents a continuous random variable.
- The number of children per household in a certain neighborhood can only take on certain discrete values such as 0, 1, 2, 3, etc., so this represents a discrete random variable.
- The number of text messages sent by a certain student during a particular day can only take on certain discrete values such as 26, 10, 17, etc., so this represents a discrete random variable.
- The number of hurricanes affecting Florida in a given year can only take on certain values such as 0, 1, 2, 3, etc., so this represents a discrete random variable.
- The number of inches of rain in Detroit, Michigan, in a certain month can take on any value, such as 2.0563, so this represents a continuous random variable.
Discrete Probability Distributions: Binomial and Poisson
Discrete random variables are of interest in many data science applications, and there are several probability distributions that apply to discrete random variables. In this chapter, we present the binomial distribution and the Poisson distribution, which are two commonly used probability distributions used to model discrete random variables for different types of events.
Binomial Distribution
The binomial distribution is used in applications where there are two possible outcomes for each trial in an experiment and the two possible outcomes can be considered as success or failure. For example, when a baseball player is at-bat, the player either gets a hit or does not get a hit. There are many applications of binomial experiments that occur in medicine, psychology, engineering, science, marketing, and other fields.
There are many statistical experiments where the results of each trial can be considered as either a success or a failure. For example, when flipping a coin, the two outcomes are heads or tails. When rolling a die, the two outcomes can be considered to be an even number appears on the face of the die or an odd number appears on the face of the die. When conducting a marketing study, a customer can be asked if they like or dislike a certain product. Note that the word “success” here does not necessarily imply a good outcome. For example, if a survey was conducted of adults and each adult was asked if they smoke, we can consider the answer “yes” to be a success and the answer “no” to be a failure. This means that the researcher can define success and failure in any way; however, the binomial distribution is applicable when there are only two outcomes in each trial of an experiment.
The requirements to identify a binomial experiment and apply the binomial distribution include:
- The experiment of interest is repeated for a fixed number of trials, and each trial is independent of other trials. For example, a market researcher might select a sample of 20 people to be surveyed where each respondent will reply with a “yes” or “no” answer. This experiment consists of 20 trials, and each person’s response to the survey question can be considered as independent of another person’s response.
- There are only two possible outcomes for each trial, which can be labeled as “success” or “failure.”
- The probability of success remains the same for each trial of the experiment. For example, from past data we know that 35% of people prefer vanilla as their favorite ice cream flavor. If a group of 15 individuals are surveyed to ask if vanilla is their favorite ice cream flavor, the probability of success for each trial will be 0.35.
- The random variable will count the number of successes in the experiment. Notice that since will count the number of successes, this implies that will be a discrete random variable. For example, if the researcher is counting the number of people in the group of 15 that respond to say vanilla is their favorite ice cream flavor, then can take on values such as 3 or 7 or 12, but could not equal 5.28 since is counting the number of people.
When working with a binomial experiment, it is useful to identify two specific parameters in a binomial experiment:
- The number of trials in the experiment. Label this as .
- The probability of success for each trial (which is a constant value). Label this as .
We then count the number of successes of interest as the value of the discrete random variable.
Label this as .
Example 3.24
Problem
A medical researcher is conducting a study related to a certain type of shoulder surgery. A sample of 20 patients who have recently undergone the surgery is selected, and the researcher wants to determine the probability that 18 of the 20 patients had a successful result from the surgery. From past data, the researcher knows that the probability of success for this type of surgery is 92%.
- Does this experiment meet the requirements for a binomial experiment?
- If so, identify the values of , , and in the experiment.
Solution
- This experiment does meet the requirements for a binomial experiment since the experiment will be repeated for 20 trials, and each response from a patient will be independent of other responses. Each reply from a patient will be one of two responses—the surgery was successful or the surgery was not successful. The probability of success remains the same for each trial at 92%. The random variable can be used to count the number of patients who respond that the surgery was successful.
- The number of trials is 20 since 20 patients are being surveyed, so .
The probability of success for each surgery is 92%, so .
The number of successes of interest is 18 since the researcher wants to determine the probability that 18 of the 20 patients had a successful result from the surgery, so .
When calculating the probability for successes in a binomial experiment, a binomial probability formula can be used, but in many cases technology is used instead to streamline the calculations.
The probability mass function (PMF) for the binomial distribution describes the probability of getting exactly successes in independent Bernoulli trials, each with a probability of success. The PMF is given by the formula:
Where:
is the probability that the random variable takes on the value of exactly successes
is the number of trials in the experiment
is the probability of success
is the number of successes in the experiment
refers to the number of ways to choose successes from
Note: The notation is read as factorial and is a mathematical notation used to express the multiplication of . For example, .
Example 3.25
Problem
For the binomial experiment discussed in Example 3.24, calculate the probability that 18 out of the 20 patients will respond to indicate that the surgery was successful. Also, show a graph of the binomial distribution to show the probability distribution for all values of the random variable .
Solution
In Example 3.24, the parameters of the binomial experiment are:
Substituting these values into the binomial probability formula, the probability for 18 successes can be calculated as follows:
Based on this result, the probability that 18 out of the 20 patients will respond to indicate that the surgery was successful is 0.271, or approximately 27%.
Figure 3.5 illustrates this binomial distribution, where the horizontal axis shows the values of the random variable , and the vertical axis shows the binomial probability for each value of . Note that values of less than 14 are not shown on the graph since these corresponding probabilities are very close to zero.
Since these computations tend to be complicated and time-consuming, most data scientists will use technology (such as Python, R, Excel, or others) to calculate binomial probabilities.
Poisson Distribution
The goal of a binomial experiment is to calculate the probability of a certain number of successes in a specific number of trials. However, there are certain scenarios where a data scientist might be interested to know the probability of a certain number of occurrences for a random variable in a specific interval, such as an interval of time.
For example, a website developer might be interested in knowing the probability that a certain number of users visit a website per minute. Or a traffic engineer might be interested in calculating the probability of a certain number of accidents per month at a busy intersection.
The Poisson distribution is applied when counting the number of occurrences in a certain interval. The random variable then counts the number of occurrences in the interval.
A common application for the Poisson distribution is to model arrivals of customers for a queue, such as when there might be 6 customers per minute arriving at a checkout lane in the grocery store and the store manager wants to ensure that customers are serviced within a certain amount of time.
The Poisson distribution is a discrete probability distribution used in these types of situations where the interest is in a specific certain number of occurrences for a random variable in a certain interval such as time or area.
The Poisson distribution is used where the following conditions are met:
- The experiment is based on counting the number of occurrences in a specific interval where the interval could represent time, area, volume, etc.
- The number of occurrences in one specific interval is independent of the number of occurrences in a different interval.
Notice that when we count the number of occurrences that a random variable occurs in a specific interval, this will represent a discrete random variable. For example, the count of the number of customers that arrive per hour to a queue for a bank teller might be 21 or 15, but the count could not be 13.32 since we are counting the number of customers and hence the random variable will be discrete.
To calculate the probability of successes, the Poisson probability formula can be used, as follows:
Where:
is the average or mean number of occurrences per interval
is the constant 2.71828…
Example 3.26
Problem
From past data, a traffic engineer determines the mean number of vehicles entering a parking garage is 7 per 10-minute period. Calculate the probability that the number of vehicles entering the garage is 9 in a certain 10-minute period. Also, show a graph of the Poisson distribution to show the probability distribution for various values of the random variable .
Solution
This example represents a Poisson distribution in that the random variable is based on the number of vehicles entering a parking garage per time interval (in this example, the time interval of interest is 10 minutes). Since the average is 7 vehicles per 10-minute interval, we label the mean as 7. Since the engineer want to know the probability that 9 vehicles enter the garage in the same time period, the value of the random variable is 9.
Thus, in this example, the parameters of the Poisson distribution are:
Substituting these values into the Poisson probability formula, the probability for 9 vehicles entering the garage in a 10-minute interval can be calculated as follows:
Thus, there is about a 10% probability of 9 vehicles entering the garage in a 10-minute interval.
Figure 3.6 illustrates this Poisson distribution, where the horizontal axis shows the values of the random variable and the vertical axis shows the Poisson probability for each value of .
As with calculations involving the binomial distribution, data scientists will typically use technology to solve problems involving the Poisson distribution.
Normal Continuous Probability Distributions
Recall that a random variable is considered continuous if the value of the random variable can take on any of infinitely many values. We used the example about that if the random variable represents the weight of a bag of apples, then can take on any value such as pounds of apples.
Many probability distributions apply to continuous random variables. These distributions rely on determining the probability that the random variable falls within a distinct range of values, which can be calculated using a probability density function (PDF). The probability density function (PDF) calculates the corresponding area under the probability density curve to determine the probability that the random variable will fall within this specific range of values. For example, to determine the probability that a salary falls between $50,000 and $70,000, we can calculate the area under the probability density function between these two salaries.
Note that the total area under the probability density function will always equal 1. The probability that a continuous random variable takes on a specific value is 0, so we will always calculate the probability for a random variable falling within some interval of values.
In this section, we will examine an important continuous probability distribution that relies on the probability density function, namely the normal distribution. Many variables, such as heights, weights, salaries, and blood pressure measurements, follow a normal distribution, making it especially important in statistical analysis. In addition, the normal distribution forms the basis for more advanced statistical analysis such as confidence intervals and hypothesis testing, which are discussed in Inferential Statistics and Regression Analysis.
The normal distribution is a continuous probability distribution that is symmetrical and bell-shaped. It is used when the frequency of data values decreases with data values above and below the mean. The normal distribution has applications in many fields including engineering, science, finance, medicine, marketing, and psychology.
The normal distribution has two parameters: the mean, , and the standard deviation, . The mean represents the center of the distribution, and the standard deviation measures the spread, or dispersion, of the distribution. The variable represents the realization, or observed value, of the random variable that follows a normal distribution.
The typical notation used to indicate that a random variable follows a normal distribution is as follows:
(see Figure 3.7). For example, the notation indicates that the random variable follows a normal distribution with mean of 5.2 and standard deviation of 3.7.
A normal distribution with mean of 0 and standard deviation of 1 is called the standard normal distribution and can be notated as . Any normal distribution can be standardized by converting its values to
-scores. Recall that a -score tells you how many standard deviations from the mean there are for a given measurement.
The curve in Figure 3.7 is symmetric on either side of a vertical line drawn through the mean, . The mean is the same as the median, which is the same as the mode, because the graph is symmetric about . As the notation indicates, the normal distribution depends only on the mean and the standard deviation. Because the area under the curve must equal 1, a change in the standard deviation, , causes a change in the shape of the normal curve; the curve becomes fatter and wider or skinnier and taller depending on . A change in causes the graph to shift to the left or right. This means there are an infinite number of normal probability distributions.
To determine probabilities associated with the normal distribution, we find specific areas under the normal curve. There are several methods for finding this area under the normal curve, and we typically use some form of technology. Python, Excel, and R all provide built-in functions for calculating areas under the normal curve.
Example 3.27
Problem
Suppose that at a software company, the mean employee salary is $60,000 with a standard deviation of $7,500. Assume salaries at this company follow a normal distribution. Use Python to calculate the probability that a random employee earns more than $68,000.
Solution
A normal curve can be drawn to represent this scenario, in which the mean of $60,000 would be plotted on the horizontal axis, corresponding to the peak of the curve. Then, to find the probability that an employee earns more than $68,000, calculate the area under the normal curve to the right of the data value $68,000.
Figure 3.8 illustrates the area under the normal curve to the right of a salary of $68,000 as the shaded-in region.
To find the actual area under the curve, a Python command can be used to find the area under the normal probability density curve to the right of the data value of $68,000. See Using Python with Probability Distributions for the specific Python program and results. The resulting probability is calculated as 0.143.
Thus, there is a probability of about 14% that a random employee has a salary greater than $75,000.
The empirical rule is a method for determining approximate areas under the normal curve for measurements that fall within one, two, and three standard deviations from the mean for the normal (bell-shaped) distribution. (See Figure 3.9).
If is a continuous random variable and has a normal distribution with mean and standard deviation , then the empirical rule states that:
- About 68% of the -values lie between and units from the mean (within one standard deviation of the mean).
- About 95% of the -values lie between and units from the mean (within two standard deviations of the mean).
- About 99.7% of the -values lie between and units from the mean (within three standard deviations of the mean). Notice that almost all the x-values lie within three standard deviations of the mean.
- The -scores for and are and , respectively.
- The -scores for and are and , respectively.
- The -scores for and are and , respectively.
Example 3.28
Problem
An automotive designer is interested in designing automotive seats to accommodate the heights for about 95% of customers. Assume the heights of adults follow a normal distribution with mean of 68 inches and standard deviation of 3 inches. For what range of heights should the designer model the car seats to accommodate 95% of drivers?
Solution
According to the empirical rule, the area under the normal curve within two standard deviations of the mean is 95%. Thus, the designer should design the seats to accommodate heights that are two standard deviations away from the mean. The lower bound of heights would be inches, and the upper bound of heights would be inches. Thus, the car seats should be designed to accommodate driver heights between 62 and 74 inches.
Exploring Further
Statistical Applets to Explore Statistical Concepts
Applets are very useful tools to help visualize statistical concepts in action. Many applets can simulate statistical concepts such as probabilities for the normal distribution, use of the empirical rule, creating box plots, etc.
Visit the Utah State University applet website and experiment with various statistical tools.
Using Python with Probability Distributions
Python provides a number of built-in functions for calculating probabilities associated with both discrete and continuous probability distributions such as binomial distribution and the normal distribution. These functions are part of a library called scipy.stats.
Here are a few of these probability density functions available within Python:
binom()
—calculate probabilities associated with the binomial distributionpoisson()
—calculate probabilities associated with the Poisson distributionexpon()
—calculate probabilities associated with the exponential distributionnorm()
—calculate probabilities associated with the normal distribution
To import these probability density functions within Python, use the import command. For example, to import the binom()
function use the following command:
from scipy.stats import binom
Using Python with the Binomial Distribution
The binom()
function in Python allows calculations of binomial probabilities. The probability mass function for the binomial distribution within Python is referred to as binom.pmf()
.
The syntax for using this function is binom.pmf
(x
, n
, p
)
Where:
is the number of trials in the experiment
is the probability of success
is the number of successes in the experiment
Consider the previous Example 3.24 worked out using the Python binom.pmf()
function. A medical researcher is conducting a study related to a certain type of shoulder surgery. A sample of 20 patients who have recently undergone the surgery is selected, and the researcher wants to determine the probability that 18 of the 20 patients had a successful result from the surgery. From past data, the researcher knows that the probability of success for this type of surgery is 92%. Round your answer to 3 decimal places.
In this example:
is the number of trials in the experiment
is the probability of success
is the number of successes in the experiment
The corresponding function in Python is written as:
binom.pmf (18, 20, 0.92)
The round()
function is then used to round the probability result to 3 decimal places.
Here is the input and output of this Python program:
Python Code
# import the binom function from the scipy.stats library
from scipy.stats import binom
# define parameters x, n, and p:
x = 18
n = 20
p = 0.92
# use binom.pmf() function to calculate binomial probability
# use round() function to round answer to 3 decimal places
round (binom.pmf(x, n, p), 3)
The resulting output will look like this:
0.271
Using Python with the Normal Distribution
The norm()
function in Python allows calculations of normal probabilities. The probability density function is sometimes called the cumulative density function, and so this is referred to as norm.cdf()
within Python. The norm.cdf()
function returns the area under the normal probability density function to the left of a specified measurement.
The syntax for using this function is
norm.cdf
(x
, mean
, standard_deviation
)
Where:
x
is the measurement of interest
mean
is the mean of the normal distribution
standard_deviation
is the standard deviation of the normal distribution
Let’s work out the previous Example 3.27 using the Python norm.cdf()
function.
Suppose that at a software company, the mean employee salary is $60,000 with a standard deviation of $7,500. Use Python to calculate the probability that a random employee earns more than $68,000.
In this example:
is the measurement of interest
mean is the mean of the normal distribution
standard deviation is the standard deviation of the normal distribution
The corresponding function in Python is written as:
norm.cdf(68000, 60000, 7500)
The round()
function is then used to round the probability result to 3 decimal places.
Notice that since this example asks to find the area to the right of a salary of $68,000, we can first find the area to the left using the norm.cdf()
function and subtract this area from 1 to then calculate the desired area to the right.
Here is the input and output of the Python program:
Python Code
# import the norm function from the scipy.stats library
from scipy.stats import norm
# define parameters x, mean and standard_deviation:
x = 68000
mean = 60000
standard_deviation = 7500
# use norm.cdf() function to calculate normal probability - note this is
# the area to the left
# subtract this result from 1 to obtain area to the right of the x-value
# use round() function to round answer to 3 decimal places
round (1 - norm.cdf(x, mean, standard_deviation), 3)
The resulting output will look like this:
0.143