Learning Outcomes
By the end of this section, you should be able to:
- 3.3.1 Define and calculate percentiles, quartiles, and -scores for a dataset.
- 3.3.2 Use Python to calculate measures of position for a dataset.
Common measures of position include percentiles and quartiles as well as -scores, all of which are used to indicate the relative location of a particular datapoint.
Percentiles
If a student scores 47 on a biology exam, it is difficult to know if the student did well or poorly compared to the population of all other students taking the exam. Percentiles provide a way to assess and compare the distribution of values and the position of a specific data point in relation to the entire dataset by indicating the percentage of data points that fall below it. Specifically, a percentile is a value on a scale of one hundred that indicates the percentage of a distribution that is equal to or below it. Let’s say the student learns they scored in the 90th percentile on the biology exam. This percentile indicates that the student has an exam score higher than 90% of all other students taking the test. This is the same as saying that the student’s score places the student in the top 10% of all students taking the biology test. Thus, this student scoring in the 90th percentile did very well on the exam, even if the actual score was 47.
To calculate percentiles, the data must be ordered from smallest to largest and then the ordered data divided into hundredths. If you score in the 80th percentile on an aptitude test, that does not necessarily mean that you scored 80% on the test. It means that 80% of the test scores are the same as or less than your score and the remaining 20% of the scores are the same as or greater than your score.
Percentiles are useful for comparing many types of values. For example, a stock market mutual fund might report that the performance for the fund over the past year was in the 80th percentile of all mutual funds in the peer group. This indicates that the fund performed better than 80% of all other funds in the peer group. This also indicates that 20% of the funds performed better than this particular fund.
To calculate percentiles for a specific data value in a dataset, first order the dataset from smallest to largest and count the number of data values in the dataset. Locate the measurement of interest and count how many data values fall below the measurement. Then the percentile for the measurement is calculated as follows:
Example 3.10
Problem
The following ordered dataset represents the scores of 15 employees on an aptitude test:
51, 63, 65, 68, 71, 75, 75, 77, 79, 82, 88, 89, 89, 92, 95
Determine the percentile for the employee who scored 88 on the aptitude test.
Solution
There are 15 data values in total, and there are 10 data values below 88.
Quartiles
While percentiles separate data into 100 equal parts, quartiles separate data into quarters, or four equal parts. To find the quartiles, first find the median, or second quartile. The first quartile, , is the middle value, or median, of the lower half of the data, and the third quartile, , is the middle value of the upper half of the data.
Note the following correspondence between quartiles and percentiles:
- The first quartile corresponds to the 25th percentile.
- The second quartile (which is the median) corresponds to the 50th percentile.
- The third quartile corresponds to the 75th percentile.
Example 3.11
Problem
Consider the following ordered dataset, which represents the time in seconds for an athlete to complete a 40-yard run:
5.4, 6.0, 6.3, 6.8, 7.1, 7.2, 7.4, 7.5, 7.9, 8.2, 8.7
Solution
The median, or second quartile, is the middle value in this dataset, which is 7.2. Notice that 50% of the data values are below the median, and 50% of the data values are above the median. The lower half of the data values are 5.4, 6.0, 6.3, 6.8, 7.1. Note that these are the data values below the median. The upper half of the data values are 7.4, 7.5, 7.9, 8.2, 8.7, which are the data values above the median.
To find the first quartile, , locate the middle value of the lower half of the data (5.4, 6.0, 6.3, 6.8, 7.1). The middle value of the lower half of the dataset is 6.3. Notice that one-fourth, or 25%, of the data values are below this first quartile, and 75% of the data values are above this first quartile.
To find the third quartile, , locate the middle value of the upper half of the data (7.4, 7.5, 7.9, 8.2, 8.7). The middle value of the upper half of the dataset is 7.9. Notice that one-fourth, or 25%, of the data values are above this third quartile, and 75% of the data values are below this third quartile.
Thus, the quartiles , , for this dataset are 6.3, 7.2, 7.9, respectively.
The interquartile range (IQR) is a number that indicates the spread of the middle half, or the middle 50%, of the data. It is the difference between the third quartile, , and the first quartile, .
Note that the IQR provides a measure of variability that excludes outliers.
In Example 3.11, the IQR can be calculated as:
Quartiles and the IQR can be used to flag possible outliers in a dataset. For example, if most employees at a company earn about $50,000 and the CEO of the company earns $2.5 million, then we consider the CEO’s salary to be an outlier data value because this salary is significantly different from all the other salaries in the dataset. An outlier data value can also be a value much lower than the other data values in a dataset, so if one employee only makes $15,000, then this employee’s low salary might also be considered an outlier.
To detect outliers, you can use the quartiles and the IQR to calculate a lower and an upper bound for outliers. Then any data values below the lower bound or above the upper bound will be flagged as outliers. These data values should be further investigated to determine the nature of the outlier condition and whether the data values are valid or not.
To calculate the lower and upper bounds for outliers, use the following formulas:
These formulas typically use 1.5 as a cutoff value to identify outliers in a dataset.
Example 3.12
Problem
Calculate the IQR for the following 13 home prices and determine if any of the home prices values are potential outliers. Data values are in US dollars.
389950, 230500, 158000, 479000, 639000, 114950, 5500000, 387000, 659000, 529000, 575000, 488800, 1095000
Solution
Order the data from smallest to largest.
114950, 158000, 230500, 387000, 389950, 479000, 488800, 529000, 575000, 639000, 659000, 1095000, 5500000
First, determine the median of the dataset. There are 13 data values, so the median is the middle data value, which is 488,800.
Next, calculate the and .
For the first quartile, look at the data values below the median. The two middle data values in this lower half of the data are 230,500 and 387,000. To determine the first quartile, find the mean of these two data values.
For the third quartile, look at the data values above the median. The two middle data values in this upper half of the data are 639,000 and 659,000. To determine the third quartile, find the mean of these two data values.
Now, calculate the interquartile range (IQR):
Calculate the value of 1.5 interquartile range (IQR):
Calculate the lower and upper bound for outliers:
The lower bound for outliers is −201,625. Of course, no home price is less than −201,625, so no outliers are present for the lower end of the dataset.
The upper bound for outliers is 1,159,375. The data value of 5,500,000 is greater than the upper bound of 1,159,375. Therefore, the home price of $5,500,000 is a potential outlier. This is important because the presence of outliers could potentially indicate data errors or some other anomalies in the dataset that should be investigated. For example, there may have been a data entry error and a home price of $550,000 was erroneously entered as $5,500,000.
-scores
The -score is a measure of the position of an entry in a dataset that makes use of the mean and standard deviation of the data. It represents the number of standard deviations by which a data value differs from the mean. For example, suppose that in a certain neighborhood, the mean selling price of a home is $350,000 and the standard deviation is $40,000. A particular home sells for $270,000. Based on the selling price of this home, we can calculate the relative standing of this home compared to other home sales in the same neighborhood.
The corresponding -score of a measurement considers the given measurement in relation to the mean and standard deviation for the entire population. The formula for a -score calculation is as follows:
Where:
is the measurement
is the mean
is the standard deviation
Notice that when a measurement is below the mean, the corresponding -score will be a negative value. If the measurement is exactly equal to the mean, the corresponding -score will be zero. If the measurement is above the mean, the corresponding -score will be a positive value.
-scores can also be used to identify outliers. Since -scores measure the number of standard deviations from the mean for a data value, a -score of 3 would indicate a data value that is 3 standard deviations above the mean. This would represent a data value that is significantly displaced from the mean, and typically, a -score less than −3 or a -score greater than +3 can be used to flag outliers.
Example 3.13
Problem
For the home example in Example 3.12, the value is the home price of $270,000, the mean is $350,000, and the standard deviation is $40,000. Calculate the -score.
Solution
The -score can be calculated as follows:
This -score of −2 indicates that the selling price for this home is 2 standard deviations below the mean, which represents a data value that is significantly below the mean.
Using Python to Calculate Measures of Position for a Dataset
DataFrame.describe()
computes different measures of position as well on each column of a dataset. See min, 25%, 50%, 75%, and max in Figure 3.4.