Learning Outcomes
By the end of this section, you should be able to:
- 3.2.1 Define and calculate the range, the variance, and the standard deviation for a dataset.
- 3.2.2 Use Python to calculate measures of variation for a dataset.
Providing some measure of the spread, or variation, in a dataset is crucial to a comprehensive summary of the dataset. Two datasets may have the same mean but can exhibit very different spread, and so a measure of dispersion for a dataset is very important. While measures of central tendency (like mean, median, and mode) describe the center or average value of a distribution, measures of dispersion give insights into how much individual data points deviate from this central value.
The following two datasets are the exam scores for a group of three students in a biology course and in a statistics course.
Dataset A: Exam scores for students in a biology course: 40, 70, 100
Dataset B: Exam scores for students in a statistics course: 69, 70, 71
Notice that the mean score for both Dataset A and Dataset B is 70.
However, the datasets are significantly different from one another:
Dataset A has larger variability where one student scored 30 points below the mean and another student scored 30 points above the mean.
Dataset B has smaller variability where the exam scores are much more tightly clustered around the mean of 70.
This example illustrates that publishing the mean of a dataset is often inadequate to fully communicate the characteristics of the dataset. Instead, data scientists will typically include a measure of variation as well.
The three primary measures of variability are range, variance, and standard deviation, and these are described next.
Range
Range is a measure of dispersion for a dataset that is calculated by subtracting the minimum from the maximum of the dataset:
Range is a straightforward calculation but makes use of only two of the data values in a dataset. The range can also be affected by outliers.
Example 3.6
Problem
Calculate the range for Dataset A and Dataset B:
Dataset A: Exam scores for students in a biology course: 40, 70, 100
Dataset B: Exam scores for students in a statistics course: 69, 70, 71
Solution
For Dataset A, the maximum data value is 100 and the minimum data value is 40.
The range is then calculated as:
For Dataset B, the maximum data value is 71 and the minimum data value is 69.
The range is then calculated as:
The range clearly indicates that there is much less spread in Dataset B as compared to Dataset A.
One drawback to the use of the range is that it doesn’t take into account every data value. The range only uses two data values from the dataset: the minimum (min) and the maximum (max). Also the range is influenced by outliers since an outlier might appear as a minimum or maximum data value and thus skew the results. For these reasons, we typically use other measures of variation, such as variance or standard deviation.
Variance
The variance provides a measure of the spread of data values by using the squared deviations from the mean. The more the individual data values differ from the mean, the larger the variance.
A financial advisor might use variance to determine the volatility of an investment and therefore help guide financial decisions. For example, a more cautious investor might opt for investments with low volatility.
The formula used to calculate variance also depends on whether the data is collected from a sample or a population. The notation is used to represent the sample variance, and the notation is used to represent the population variance.
Formula for the sample variance:
Formula for the population variance:
In these formulas:
represents the individual data values
represents the sample mean
represents the sample size
represents the population mean
represents the population size
Alternate Formula for Variance
An alternate formula for the variance is available. It is sometimes used for more efficient computations:
In the formulas for sample variance and population variance, notice the denominator for the sample variance is , whereas the denominator for the population variance is . The use of in the denominator of the sample variance is used to provide the best estimate for the population variance, in the sense that if repeated samples of size are taken and the sample mean computed each time, then the average of those sample means will tend to the population mean as the number of repeated samples increase.
It is important to note that in many data science applications, population data is unavailable, and so we typically calculate the sample variance. For example, if a researcher wanted to estimate the percentage of smokers for all adults in the United States, it would be impractical to collect data from every adult in the United States.
Notice that the sample variance is a sum of squares. Its units of measurement are squares of the units of measurement of the original data. Since these square units are different than the units in the original data, this can be confusing. By contrast, standard deviation is measured in the same units as the original dataset, and thus the standard deviation is more commonly used to measure the spread of a dataset.
Standard Deviation
The standard deviation of a dataset provides a numerical measure of the overall amount of variation in a dataset in the same units as the data; it can be used to determine whether a particular data value is close to or far from the mean, relative to the typical distance from the mean.
The standard deviation is always positive or zero. It is small when the data values are all concentrated close to the mean, exhibiting little variation, or spread. It is larger when the data values are spread out more from the mean, exhibiting more variation. A smaller standard deviation implies less variability in a dataset, and a larger standard deviation implies more variability in a dataset.
Suppose that we are studying the variability of two companies (A and B) with respect to employee salaries. The average salary for both companies is $60,000. For Company A, the standard deviation of salaries is $8,000, whereas the standard deviation for salaries for Company B is $19,000. Because Company B has a higher standard deviation, we know that there is more variation in the employee salaries for Company B as compared to Company A.
There are two different formulas for calculating standard deviation. Which formula to use depends on whether the data represents a sample or a population. The notation is used to represent the sample standard deviation, and the notation is used to represent the population standard deviation. In the formulas shown, is the sample mean, is the population mean, is the sample size, and is the population size.
Formula for the sample standard deviation:
Formula for the population standard deviation:
Notice that the sample standard deviation is calculated as the square root of the variance. This means that once the sample variance has been calculated, the sample standard deviation can then be easily calculated as the square root of the sample variance, as in Example 3.7.
Example 3.7
Problem
A biologist calculates that the sample variance for the amount of plant growth for a sample of plants is 8.7 cm2. Calculate the sample standard deviation.
Solution
The sample standard deviation () is calculated as the square root of the variance.
Example 3.8
Problem
Assume the sample variance () for a dataset is calculated as 42.2. Based on this, calculate the sample standard deviation.
Solution
The sample standard deviation () is calculated as the square root of the variance.
This result indicates that the standard deviation is about 6.5 years.
Notice that the sample variance is the square of the sample standard deviation, so if the sample standard deviation is known, the sample variance can easily be calculated.
Use of Technology for Calculating Measures of Variability
Due to the complexity of calculating variance and standard deviation, technology is typically utilized to calculate these measures of variability. For example, refer to the examples shown in Coefficient of Variation on using Python for measures of variation.
Coefficient of Variation
A data scientist might be interested in comparing variation with different units of measurement of different means, and in these scenarios the coefficient of variation (CV) can be used. The coefficient of variation measures the variation of a dataset by calculating the standard deviation as a percentage of the mean. Note: coefficient of variation is typically expressed in a percentage format.
Example 3.9
Problem
Compare the relative variability for Company A versus Company B using the coefficient of variation, based on the following sample data:
Company A:
Company B:
Solution
Calculate the coefficient of variation for each company:
Company A exhibits more variability relative to the mean as compared to Company B.
Using Python for Measures of Variation
DataFrame.describe()
computes standard deviation as well on each column of a dataset. The std lists the standard deviation of each column (See Figure 3.3).