Skip to ContentGo to accessibility pageKeyboard shortcuts menu
OpenStax Logo
Principles of Data Science

3.3 Measures of Position

Principles of Data Science3.3 Measures of Position

Learning Outcomes

By the end of this section, you should be able to:

  • 3.3.1 Define and calculate percentiles, quartiles, and z z -scores for a dataset.
  • 3.3.2 Use Python to calculate measures of position for a dataset.

Common measures of position include percentiles and quartiles as well as zz-scores, all of which are used to indicate the relative location of a particular datapoint.

Percentiles

If a student scores 47 on a biology exam, it is difficult to know if the student did well or poorly compared to the population of all other students taking the exam. Percentiles provide a way to assess and compare the distribution of values and the position of a specific data point in relation to the entire dataset by indicating the percentage of data points that fall below it. Specifically, a percentile is a value on a scale of one hundred that indicates the percentage of a distribution that is equal to or below it. Let’s say the student learns they scored in the 90th percentile on the biology exam. This percentile indicates that the student has an exam score higher than 90% of all other students taking the test. This is the same as saying that the student’s score places the student in the top 10% of all students taking the biology test. Thus, this student scoring in the 90th percentile did very well on the exam, even if the actual score was 47.

To calculate percentiles, the data must be ordered from smallest to largest and then the ordered data divided into hundredths. If you score in the 80th percentile on an aptitude test, that does not necessarily mean that you scored 80% on the test. It means that 80% of the test scores are the same as or less than your score and the remaining 20% of the scores are the same as or greater than your score.

Percentiles are useful for comparing many types of values. For example, a stock market mutual fund might report that the performance for the fund over the past year was in the 80th percentile of all mutual funds in the peer group. This indicates that the fund performed better than 80% of all other funds in the peer group. This also indicates that 20% of the funds performed better than this particular fund.

To calculate percentiles for a specific data value in a dataset, first order the dataset from smallest to largest and count the number of data values in the dataset. Locate the measurement of interest and count how many data values fall below the measurement. Then the percentile for the measurement is calculated as follows:

Percentile=number of data values below the measurementtotal number of data values×100%=nN×100%Percentile=number of data values below the measurementtotal number of data values×100%=nN×100%

Example 3.10

Problem

The following ordered dataset represents the scores of 15 employees on an aptitude test:

51, 63, 65, 68, 71, 75, 75, 77, 79, 82, 88, 89, 89, 92, 95

Determine the percentile for the employee who scored 88 on the aptitude test.

Quartiles

While percentiles separate data into 100 equal parts, quartiles separate data into quarters, or four equal parts. To find the quartiles, first find the median, or second quartile. The first quartile, Q1Q1, is the middle value, or median, of the lower half of the data, and the third quartile, Q3Q3, is the middle value of the upper half of the data.

Note the following correspondence between quartiles and percentiles:

  • The first quartile corresponds to the 25th percentile.
  • The second quartile (which is the median) corresponds to the 50th percentile.
  • The third quartile corresponds to the 75th percentile.

Example 3.11

Problem

Consider the following ordered dataset, which represents the time in seconds for an athlete to complete a 40-yard run:

5.4, 6.0, 6.3, 6.8, 7.1, 7.2, 7.4, 7.5, 7.9, 8.2, 8.7

The interquartile range (IQR) is a number that indicates the spread of the middle half, or the middle 50%, of the data. It is the difference between the third quartile, Q3Q3, and the first quartile, Q1Q1.

IQR=Q3Q1IQR=Q3Q1

Note that the IQR provides a measure of variability that excludes outliers.

In Example 3.11, the IQR can be calculated as:

IQR=Q3Q1=7.96.3=1.6IQR=Q3Q1=7.96.3=1.6

Quartiles and the IQR can be used to flag possible outliers in a dataset. For example, if most employees at a company earn about $50,000 and the CEO of the company earns $2.5 million, then we consider the CEO’s salary to be an outlier data value because this salary is significantly different from all the other salaries in the dataset. An outlier data value can also be a value much lower than the other data values in a dataset, so if one employee only makes $15,000, then this employee’s low salary might also be considered an outlier.

To detect outliers, you can use the quartiles and the IQR to calculate a lower and an upper bound for outliers. Then any data values below the lower bound or above the upper bound will be flagged as outliers. These data values should be further investigated to determine the nature of the outlier condition and whether the data values are valid or not.

To calculate the lower and upper bounds for outliers, use the following formulas:

Lower Bound for Outliers=Q1(1.5·IQR)Upper Bound for Outliers=Q3+(1.5·IQR)Lower Bound for Outliers=Q1(1.5·IQR)Upper Bound for Outliers=Q3+(1.5·IQR)

These formulas typically use 1.5 as a cutoff value to identify outliers in a dataset.

Example 3.12

Problem

Calculate the IQR for the following 13 home prices and determine if any of the home prices values are potential outliers. Data values are in US dollars.

389950, 230500, 158000, 479000, 639000, 114950, 5500000, 387000, 659000, 529000, 575000, 488800, 1095000

zz-scores

The zz-score is a measure of the position of an entry in a dataset that makes use of the mean and standard deviation of the data. It represents the number of standard deviations by which a data value differs from the mean. For example, suppose that in a certain neighborhood, the mean selling price of a home is $350,000 and the standard deviation is $40,000. A particular home sells for $270,000. Based on the selling price of this home, we can calculate the relative standing of this home compared to other home sales in the same neighborhood.

The corresponding zz-score of a measurement considers the given measurement in relation to the mean and standard deviation for the entire population. The formula for a zz-score calculation is as follows:

z=xµσz=xµσ

Where:
xx is the measurement
µµ is the mean
σσ is the standard deviation

Notice that when a measurement is below the mean, the corresponding zz-score will be a negative value. If the measurement is exactly equal to the mean, the corresponding zz-score will be zero. If the measurement is above the mean, the corresponding zz-score will be a positive value.

zz-scores can also be used to identify outliers. Since zz-scores measure the number of standard deviations from the mean for a data value, a zz-score of 3 would indicate a data value that is 3 standard deviations above the mean. This would represent a data value that is significantly displaced from the mean, and typically, a zz-score less than −3 or a zz-score greater than +3 can be used to flag outliers.

Example 3.13

Problem

For the home example in Example 3.12, the xx value is the home price of $270,000, the mean µµ is $350,000, and the standard deviation σσ is $40,000. Calculate the zz-score.

Using Python to Calculate Measures of Position for a Dataset

DataFrame.describe() computes different measures of position as well on each column of a dataset. See min, 25%, 50%, 75%, and max in Figure 3.4.

A data table summarizing statistics about 966 items in the “movie profit” dataset. The minimum, maximum, and quartile rows are highlighted. The minimum rating is 3.3, the minimum duration is 69, the minimum US gross earnings are $.01 million, and the minimum worldwide gross earnings are $176.6 million. The maximum rating is about 9.2, the maximum duration is 238, the maximum US gross earnings are about $936.7 million, and the maximum worldwide gross earnings are about $2.847 billion. For the 25th percentile, the rating is about 6.2, the duration is about 101.3, the US gross earnings are about $90.8 million, and the worldwide gross earnings are about $223.3 million. For the 50th percentile (median), the rating is about 6.8, the duration is about 116, the US gross earnings are about $129.2 million, and the worldwide gross earnings are about $309.3 million. For the 75th percentile, the rating is about 7.4, the duration is about 130, the US gross earnings are about $187.1 million, and the worldwide gross earnings are about $472.6 million.
Figure 3.4 The Output of
DataFrame.describe()
with the Movie Profit Dataset
Citation/Attribution

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution-NonCommercial-ShareAlike License and you must attribute OpenStax.

Attribution information
  • If you are redistributing all or part of this book in a print format, then you must include on every physical page the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
  • If you are redistributing all or part of this book in a digital format, then you must include on every digital page view the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
Citation information

© Dec 19, 2024 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.