Skip to ContentGo to accessibility pageKeyboard shortcuts menu
OpenStax Logo
Principles of Data Science

9.3 Graphing Probability Distributions

Principles of Data Science9.3 Graphing Probability Distributions

Learning Outcomes

By the end of this section, you should be able to:

  • 9.3.1 Create graphs to visualize the shape of various types of probability distributions.
  • 9.3.2 Interpret probabilities as areas under probability distributions.
  • 9.3.3 Use Python to generate various data visualizations for probability distributions

We introduced probability distributions in Discrete and Continuous Probability Distributions. Recall that the binomial distribution and the Poisson distribution are examples of discrete probability distributions, and the normal distribution is an example of a continuous probability distribution. In addition, a discrete random variable is a random variable where there is only a finite or countable infinite number of values that the variable can take on. A continuous random variable is a random variable where there is an infinite number of values that the variable can take on.

Very often, a data scientist or researcher is interested in graphing these probability distributions to visualize the shape of the distribution and gain insight into the behavior of the distribution for various values of the random variable. For continuous probability distributions, the area under the probability density function (PDF) is equivalent to probabilities associated with the normal probability distribution. Determining probabilities associated with probability distribution allows data scientists to measure and predict probabilities and to estimate the likelihood of achieving certain outcomes.These probabilities also form the basis for more advanced statistical analysis such as confidence interval determination and hypothesis testing.

Graphing the Binomial Distribution

The binomial distribution is used in applications where there are two possible outcomes for each trial in an experiment and the two possible outcomes can be considered as success or failure.

The necessary requirements to identify a binomial experiment and apply the binomial distribution include the following:

  • The experiment of interest is repeated for a fixed number of trials, and each trial is independent of other trials.
  • There are only two possible outcomes for each trial, which can be labeled as “success” or “failure.”
  • The probability of success remains the same for each trial of the experiment.
  • The random variable x counts the number of successes in the experiment. Notice that since x counts the number of successes, this implies that x is a discrete random variable.

When working with a binomial experiment, it is useful to identify two specific parameters in a binomial experiment:

  1. The number of trials in the experiment. Label this as n.
  2. The probability of success for each trial (which is a constant value). Label this as p.

We then count the number of successes of interest as the value of the discrete random variable. Label this as x.

Discrete and Continuous Probability Distributions introduced the binomial probability formula used to calculate binomial probabilities. However, most data scientists and researchers use technology such as Python to calculate probabilities associated with both discrete and continuous probability distributions such as the binomial distribution and the normal distribution.

Python provides a number of built-in functions for calculating these probabilities as part of the scipy.stats library.

The function binom() is the probability mass function used to calculate probabilities associated with the binomial distribution. The syntax for using this function is

binom.pmf(x, n, p)

Where:

n is the number of trials in the experiment,

p is the probability of success,

x is the number of successes in the experiment.

Data scientists are interested in graphing discrete distributions such as the binomial distribution to help to predict the probability of achieving a certain number of successes in a given number of trials with a specified probability of success. This also allows visualization of the shape of the binomial distribution as parameters such as n and p are varied. Changes to the value of the probability of success can affect the shape of the distribution, as shown in Example 9.7.

Example 9.7

Problem

Use the Python function binom to graph the binomial distribution for n=20n=20 and three different values of pp, namely p=0.10,0.50p=0.10,0.50, and 0.850.85, and comment on the resulting shapes of the distributions.

Graphing the Poisson Distribution

The Poisson distribution is used when counting the number of occurrences in a certain interval. The random variable then counts the number of occurrences in that interval.

The Poisson distribution is used in the types of situations where the interest is in a specific certain number of occurrences for a random variable in a certain interval such as time or area.

Recall from Discrete and Continuous Probability Distributions that the Poisson distribution is used where the following conditions are met:

  • The experiment is based on counting the number of occurrences in a specific interval where the interval could represent time, area, volume, etc.
  • The number of occurrences in one specific interval is independent of the number of occurrences in a different interval.

Notice that when we count the number of occurrences of a random variable x in a specific interval, this will represent a discrete random variable.

Similar to the binom() function within the scipy.stats library, Python also provides the poisson() function to calculate probabilities associated with the Poisson distribution.

The syntax for using this function is:

poisson.pmf(x, mu)

Where:

mu is the mean of the Poisson distribution (mu refers to the Greek letter μμ).

x is the number of successes in the specific interval of interest.

Example 9.8

Problem

The Port of Miami is home to many cruise lines, and the port is typically heavily congested with arriving cruise ships. Delays are common since there are typically not enough ports to accept the number of arriving ships. A data scientist is interested in analyzing this situation and notices that on average, 3 ships arrive every hour at the port. Assume the arrival times are independent of one another.

Use the Python function poisson to graph the Poisson distribution for n=12n=12 hours and mean μμ of 3 ships per hour.

Based on the graph, come to a conclusion regarding the number of ships arriving per hour.

Graphing the Normal Distribution

The normal distribution discussed in Discrete and Continuous Probability Distributions is one of the more important distributions in data science and statistical analysis. This continuous distribution is especially important in statistical analysis in that many measurements in nature, science, engineering, medicine, and business follow a normal distribution. The normal distribution also forms the basis for more advanced statistical analysis such as confidence intervals and hypothesis testing discussed in Hypothesis Testing.

The normal distribution is bell-shaped and has two parameters: the mean, μμ, and the standard deviation, σσ. The mean represents the center of the distribution, and the standard deviation measures the spread or dispersion of the distribution. The variable x represents the realization or observed value of the random variable X that follows a normal distribution.

The graph of the normal curve is symmetric about the mean μμ. The shape of the graph of the normal distribution depends only on the mean and the standard deviation. Because the area under the curve must equal 1, a change in the standard deviation, σσ, causes a change in the shape of the normal curve; the curve becomes fatter and wider or skinnier and taller depending on the standard deviation μμ. A change in the mean σσ causes the graph to shift to the left or right. To determine probabilities associated with the normal distribution, we find specific areas under the graph of the normal curve within certain intervals of the random variable.

To generate a graph of the normal distribution, we can plot many points that fall on the normal curve according to the formal definition of the probability density function (PDF) of the normal distribution, as follows:

f(x)=1σ2πe(xμ)22σ2f(x)=1σ2πe(xμ)22σ2

Of course, this is a job left to Python, and the Python code to generate a graph of the normal distribution is shown in Example 9.9.

Example 9.9

Problem

A medical researcher is investigating blood pressure–lowering medications and wants to create a graph of systolic blood pressures. Assume that blood pressures follow a normal distribution with mean of 120 and standard deviation of 20. Use Python to create a graph of this normal distribution.

Citation/Attribution

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution-NonCommercial-ShareAlike License and you must attribute OpenStax.

Attribution information
  • If you are redistributing all or part of this book in a print format, then you must include on every physical page the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
  • If you are redistributing all or part of this book in a digital format, then you must include on every digital page view the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
Citation information

© Dec 19, 2024 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.