Skip to ContentGo to accessibility pageKeyboard shortcuts menu
OpenStax Logo
Principles of Data Science

3.2 Measures of Variation

Principles of Data Science3.2 Measures of Variation

Learning Outcomes

By the end of this section, you should be able to:

  • 3.2.1 Define and calculate the range, the variance, and the standard deviation for a dataset.
  • 3.2.2 Use Python to calculate measures of variation for a dataset.

Providing some measure of the spread, or variation, in a dataset is crucial to a comprehensive summary of the dataset. Two datasets may have the same mean but can exhibit very different spread, and so a measure of dispersion for a dataset is very important. While measures of central tendency (like mean, median, and mode) describe the center or average value of a distribution, measures of dispersion give insights into how much individual data points deviate from this central value.

The following two datasets are the exam scores for a group of three students in a biology course and in a statistics course.

Dataset A: Exam scores for students in a biology course: 40, 70, 100
Dataset B: Exam scores for students in a statistics course: 69, 70, 71

Notice that the mean score for both Dataset A and Dataset B is 70.

However, the datasets are significantly different from one another:

Dataset A has larger variability where one student scored 30 points below the mean and another student scored 30 points above the mean.
Dataset B has smaller variability where the exam scores are much more tightly clustered around the mean of 70.

This example illustrates that publishing the mean of a dataset is often inadequate to fully communicate the characteristics of the dataset. Instead, data scientists will typically include a measure of variation as well.

The three primary measures of variability are range, variance, and standard deviation, and these are described next.

Range

Range is a measure of dispersion for a dataset that is calculated by subtracting the minimum from the maximum of the dataset:

Range=MaxMinRange=MaxMin

Range is a straightforward calculation but makes use of only two of the data values in a dataset. The range can also be affected by outliers.

Example 3.6

Problem

Calculate the range for Dataset A and Dataset B:

Dataset A: Exam scores for students in a biology course: 40, 70, 100
Dataset B: Exam scores for students in a statistics course: 69, 70, 71

One drawback to the use of the range is that it doesn’t take into account every data value. The range only uses two data values from the dataset: the minimum (min) and the maximum (max). Also the range is influenced by outliers since an outlier might appear as a minimum or maximum data value and thus skew the results. For these reasons, we typically use other measures of variation, such as variance or standard deviation.

Variance

The variance provides a measure of the spread of data values by using the squared deviations from the mean. The more the individual data values differ from the mean, the larger the variance.

A financial advisor might use variance to determine the volatility of an investment and therefore help guide financial decisions. For example, a more cautious investor might opt for investments with low volatility.

The formula used to calculate variance also depends on whether the data is collected from a sample or a population. The notation s2s2 is used to represent the sample variance, and the notation σ2σ2 is used to represent the population variance.

Formula for the sample variance:

s2=(xx)2n1s2=(xx)2n1

Formula for the population variance:

σ2=(xµ)2Nσ2=(xµ)2N

In these formulas:
xx represents the individual data values
xx represents the sample mean
nn represents the sample size
µµ represents the population mean
NN represents the population size

Alternate Formula for Variance

An alternate formula for the variance is available. It is sometimes used for more efficient computations:

σ2=x2Nµ2σ2=x2Nµ2

In the formulas for sample variance and population variance, notice the denominator for the sample variance is n1n1, whereas the denominator for the population variance is NN. The use of n1n1 in the denominator of the sample variance is used to provide the best estimate for the population variance, in the sense that if repeated samples of size nn are taken and the sample mean computed each time, then the average of those sample means will tend to the population mean as the number of repeated samples increase.

It is important to note that in many data science applications, population data is unavailable, and so we typically calculate the sample variance. For example, if a researcher wanted to estimate the percentage of smokers for all adults in the United States, it would be impractical to collect data from every adult in the United States.

Notice that the sample variance is a sum of squares. Its units of measurement are squares of the units of measurement of the original data. Since these square units are different than the units in the original data, this can be confusing. By contrast, standard deviation is measured in the same units as the original dataset, and thus the standard deviation is more commonly used to measure the spread of a dataset.

Standard Deviation

The standard deviation of a dataset provides a numerical measure of the overall amount of variation in a dataset in the same units as the data; it can be used to determine whether a particular data value is close to or far from the mean, relative to the typical distance from the mean.

The standard deviation is always positive or zero. It is small when the data values are all concentrated close to the mean, exhibiting little variation, or spread. It is larger when the data values are spread out more from the mean, exhibiting more variation. A smaller standard deviation implies less variability in a dataset, and a larger standard deviation implies more variability in a dataset.

Suppose that we are studying the variability of two companies (A and B) with respect to employee salaries. The average salary for both companies is $60,000. For Company A, the standard deviation of salaries is $8,000, whereas the standard deviation for salaries for Company B is $19,000. Because Company B has a higher standard deviation, we know that there is more variation in the employee salaries for Company B as compared to Company A.

There are two different formulas for calculating standard deviation. Which formula to use depends on whether the data represents a sample or a population. The notation ss is used to represent the sample standard deviation, and the notation σσ is used to represent the population standard deviation. In the formulas shown, xx is the sample mean, µµ is the population mean, nn is the sample size, and NN is the population size.

Formula for the sample standard deviation:

s=(xx)2n1s=(xx)2n1

Formula for the population standard deviation:

σ=(xµ)2Nσ=(xµ)2N

Notice that the sample standard deviation is calculated as the square root of the variance. This means that once the sample variance has been calculated, the sample standard deviation can then be easily calculated as the square root of the sample variance, as in Example 3.7.

Example 3.7

Problem

A biologist calculates that the sample variance for the amount of plant growth for a sample of plants is 8.7 cm2. Calculate the sample standard deviation.

Example 3.8

Problem

Assume the sample variance (s2s2) for a dataset is calculated as 42.2. Based on this, calculate the sample standard deviation.

Notice that the sample variance is the square of the sample standard deviation, so if the sample standard deviation is known, the sample variance can easily be calculated.

Use of Technology for Calculating Measures of Variability

Due to the complexity of calculating variance and standard deviation, technology is typically utilized to calculate these measures of variability. For example, refer to the examples shown in Coefficient of Variation on using Python for measures of variation.

Coefficient of Variation

A data scientist might be interested in comparing variation with different units of measurement of different means, and in these scenarios the coefficient of variation (CV) can be used. The coefficient of variation measures the variation of a dataset by calculating the standard deviation as a percentage of the mean. Note: coefficient of variation is typically expressed in a percentage format.

CV = σ μ × 100 % Sample CV = s x × 100 % CV = σ μ × 100 % Sample CV = s x × 100 %

Example 3.9

Problem

Compare the relative variability for Company A versus Company B using the coefficient of variation, based on the following sample data:

Company A: Sample Mean=$68,000, Sample Standard Deviation=$9,200Sample Mean=$68,000, Sample Standard Deviation=$9,200

Company B: Sample Mean=$71,000, Sample Standard Deviation=$6,400Sample Mean=$71,000, Sample Standard Deviation=$6,400

Using Python for Measures of Variation

DataFrame.describe() computes standard deviation as well on each column of a dataset. The std lists the standard deviation of each column (See Figure 3.3).

A data table summarizing statistics about 966 items in the “movie profit” dataset, with columns for “unnamed: 0,” “rating,” “duration,” “US gross” and “worldwide gross.”  The standard deviation row is highlighted. The standard deviation is about 0.89 for ratings and about 21.6 for durations.  The standard deviation for US gross earnings is about $110.6 million and for worldwide gross earnings about $294.76 million.
Figure 3.3 The Output of
DataFrame.describe()
with the Movie Profit Dataset
Citation/Attribution

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution-NonCommercial-ShareAlike License and you must attribute OpenStax.

Attribution information
  • If you are redistributing all or part of this book in a print format, then you must include on every physical page the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
  • If you are redistributing all or part of this book in a digital format, then you must include on every digital page view the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
Citation information

© Dec 19, 2024 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.