Learning Outcomes
By the end of this section, you should be able to:
- 9.3.1 Create graphs to visualize the shape of various types of probability distributions.
- 9.3.2 Interpret probabilities as areas under probability distributions.
- 9.3.3 Use Python to generate various data visualizations for probability distributions
We introduced probability distributions in Discrete and Continuous Probability Distributions. Recall that the binomial distribution and the Poisson distribution are examples of discrete probability distributions, and the normal distribution is an example of a continuous probability distribution. In addition, a discrete random variable is a random variable where there is only a finite or countable infinite number of values that the variable can take on. A continuous random variable is a random variable where there is an infinite number of values that the variable can take on.
Very often, a data scientist or researcher is interested in graphing these probability distributions to visualize the shape of the distribution and gain insight into the behavior of the distribution for various values of the random variable. For continuous probability distributions, the area under the probability density function (PDF) is equivalent to probabilities associated with the normal probability distribution. Determining probabilities associated with probability distribution allows data scientists to measure and predict probabilities and to estimate the likelihood of achieving certain outcomes.These probabilities also form the basis for more advanced statistical analysis such as confidence interval determination and hypothesis testing.
Graphing the Binomial Distribution
The binomial distribution is used in applications where there are two possible outcomes for each trial in an experiment and the two possible outcomes can be considered as success or failure.
The necessary requirements to identify a binomial experiment and apply the binomial distribution include the following:
- The experiment of interest is repeated for a fixed number of trials, and each trial is independent of other trials.
- There are only two possible outcomes for each trial, which can be labeled as “success” or “failure.”
- The probability of success remains the same for each trial of the experiment.
- The random variable x counts the number of successes in the experiment. Notice that since x counts the number of successes, this implies that x is a discrete random variable.
When working with a binomial experiment, it is useful to identify two specific parameters in a binomial experiment:
- The number of trials in the experiment. Label this as n.
- The probability of success for each trial (which is a constant value). Label this as p.
We then count the number of successes of interest as the value of the discrete random variable. Label this as x.
Discrete and Continuous Probability Distributions introduced the binomial probability formula used to calculate binomial probabilities. However, most data scientists and researchers use technology such as Python to calculate probabilities associated with both discrete and continuous probability distributions such as the binomial distribution and the normal distribution.
Python provides a number of built-in functions for calculating these probabilities as part of the scipy.stats
library.
The function binom()
is the probability mass function used to calculate probabilities associated with the binomial distribution. The syntax for using this function is
binom.pmf(x, n, p)
Where:
n is the number of trials in the experiment,
p is the probability of success,
x is the number of successes in the experiment.
Data scientists are interested in graphing discrete distributions such as the binomial distribution to help to predict the probability of achieving a certain number of successes in a given number of trials with a specified probability of success. This also allows visualization of the shape of the binomial distribution as parameters such as n and p are varied. Changes to the value of the probability of success can affect the shape of the distribution, as shown in Example 9.7.
Example 9.7
Problem
Use the Python function binom
to graph the binomial distribution for and three different values of , namely , and , and comment on the resulting shapes of the distributions.
Solution
To create a graph of the binomial distribution, we will import the binom
function from the scipy.stats
library. We will then use the bar
function to create a bar chart of the binomial probabilities.
Here is the Python code and graph for and .
Note: To generate the plots for and , change the one line of Python code, , as needed.
Python Code
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import binom
# Define parameters for the binomial distribution where:
# n is number of trials
# p is the probability of success
n = 20
p = 0.1
# Generate values for x, which is the number of successes
# Note that x can take on the value of zero since there can be
# zero successes in a typical binomial experiment
x = np.arange(0, n+1)
# Calculate the probability mass function (PMF) for the binomial distribution
probabilities = binom.pmf(x, n, p)
# Plot the binomial distribution
plt.bar(x, probabilities)
# Add labels and title
plt.xlabel('Number of Successes')
plt.ylabel('Probability')
plt.title('Binomial Distribution ($n = 20$, $p = 0.1$)')
# Show the plot
plt.show()
The resulting output will look like this:
Here is the graph for and .
Here is the graph for and .
Notice the impact of p on the shape of the graph: for , the graph is right skewed, for , the graph resembles a bell-shaped distribution, and for , the graph is left skewed.
Graphing the Poisson Distribution
The Poisson distribution is used when counting the number of occurrences in a certain interval. The random variable then counts the number of occurrences in that interval.
The Poisson distribution is used in the types of situations where the interest is in a specific certain number of occurrences for a random variable in a certain interval such as time or area.
Recall from Discrete and Continuous Probability Distributions that the Poisson distribution is used where the following conditions are met:
- The experiment is based on counting the number of occurrences in a specific interval where the interval could represent time, area, volume, etc.
- The number of occurrences in one specific interval is independent of the number of occurrences in a different interval.
Notice that when we count the number of occurrences of a random variable x in a specific interval, this will represent a discrete random variable.
Similar to the binom()
function within the scipy.stats
library, Python also provides the poisson()
function to calculate probabilities associated with the Poisson distribution.
The syntax for using this function is:
poisson.pmf(x, mu)
Where:
mu is the mean of the Poisson distribution (mu refers to the Greek letter ).
x is the number of successes in the specific interval of interest.
Example 9.8
Problem
The Port of Miami is home to many cruise lines, and the port is typically heavily congested with arriving cruise ships. Delays are common since there are typically not enough ports to accept the number of arriving ships. A data scientist is interested in analyzing this situation and notices that on average, 3 ships arrive every hour at the port. Assume the arrival times are independent of one another.
Use the Python function poisson
to graph the Poisson distribution for hours and mean of 3 ships per hour.
Based on the graph, come to a conclusion regarding the number of ships arriving per hour.
Solution
To create a graph of the Poisson distribution, we will import the poisson
function from the scipy.stats
library. We will then use the bar
function to create a bar chart of the corresponding Poisson probabilities.
Here is the Python code and graph for and mean () of 3.
Python Code
import matplotlib.pyplot as plt
from scipy.stats import poisson
# Define the mean for the Poisson distribution
# this is the average number of ships arriving per hour
mu = 3
# Generate x values (number of ships arriving per hour)
x = np.arange(0, 12)
# Calculate the probability mass function (PMF) for the Poisson distribution
probabilities = poisson.pmf(x, mu)
# Plot the Poisson distribution
plt.bar(x, probabilities)
# Add labels and title
plt.xlabel('Number of Occurrences')
plt.ylabel('Probability')
plt.title('Poisson Distribution ($\mu$ = 3)')
# Show the plot
plt.show()
The resulting output will look like this:
From the graph, there are lower probabilities for and larger. So there is not much of a chance that 7 or more ships will arrive in a one-hour time period. The administrators for the Port of Miami should plan for 6 or fewer ships arriving per hour to cover the most likely scenarios.
Graphing the Normal Distribution
The normal distribution discussed in Discrete and Continuous Probability Distributions is one of the more important distributions in data science and statistical analysis. This continuous distribution is especially important in statistical analysis in that many measurements in nature, science, engineering, medicine, and business follow a normal distribution. The normal distribution also forms the basis for more advanced statistical analysis such as confidence intervals and hypothesis testing discussed in Hypothesis Testing.
The normal distribution is bell-shaped and has two parameters: the mean, , and the standard deviation, . The mean represents the center of the distribution, and the standard deviation measures the spread or dispersion of the distribution. The variable x represents the realization or observed value of the random variable X that follows a normal distribution.
The graph of the normal curve is symmetric about the mean . The shape of the graph of the normal distribution depends only on the mean and the standard deviation. Because the area under the curve must equal 1, a change in the standard deviation, , causes a change in the shape of the normal curve; the curve becomes fatter and wider or skinnier and taller depending on the standard deviation . A change in the mean causes the graph to shift to the left or right. To determine probabilities associated with the normal distribution, we find specific areas under the graph of the normal curve within certain intervals of the random variable.
To generate a graph of the normal distribution, we can plot many points that fall on the normal curve according to the formal definition of the probability density function (PDF) of the normal distribution, as follows:
Of course, this is a job left to Python, and the Python code to generate a graph of the normal distribution is shown in Example 9.9.
Example 9.9
Problem
A medical researcher is investigating blood pressure–lowering medications and wants to create a graph of systolic blood pressures. Assume that blood pressures follow a normal distribution with mean of 120 and standard deviation of 20. Use Python to create a graph of this normal distribution.
Solution
To create a graph of the normal distribution, we will plot a large number of data points (say 1,000 points), and the values of x will range from and . Recall from the discussion in Discrete and Continuous Probability Distributions regarding the empirical rule that about 99.7% of the x-values lie between and units from the mean (i.e., within three standard deviations of the mean). In Python, we can use the np.linspace
function to create evenly spaced points between a lower bound and upper bound. Then we can use the plot
function to plot the points corresponding to the normal density function. Note: We also need the numpy library to calculate numerical functions such as square root.
Python Code
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
# Define mean mu and standard deviation sigma for the normal distribution
mu = 120
sigma = 20
# Generate data points from the normal distribution
# Use np.linspace to generate 1,000 evenly spaced points
# These points will extend from mu - 3sigma to mu+3sigma bounds
x = np.linspace(mu - 3*sigma, mu + 3*sigma, 1000)
# Calculate values of the normal density function using norm.pdf function
y = norm.pdf(x,mu,sigma)
# Plot the normal distribution
plt.plot(x, y)
# Add labels and title
plt.xlabel('Blood Pressure')
plt.ylabel('Probability Density')
plt.title('Normal Distribution for Blood Pressure')
# Show the plot
plt.show()
The resulting output will look like this: