Skip to ContentGo to accessibility pageKeyboard shortcuts menu
OpenStax Logo
Principles of Data Science

3.1 Measures of Center

Principles of Data Science3.1 Measures of Center

Learning Outcomes

By the end of this section, you should be able to:

  • 3.1.1 Define and calculate mean, trimmed mean, median, and mode for a dataset.
  • 3.1.2 Determine the effect of outliers on the mean and the median.
  • 3.1.3 Use Python to calculate measures of center for a dataset.

Measures of center are statistical measurements that provide a central, or typical, representation of a dataset. These measures can help indicate where the bulk of the data is concentrated and are often called the data’s central tendency. The most widely used measures of the center of a dataset are the mean (average), the median, and the mode.

Mean and Trimmed Mean

The mean, or average, (sometimes referred to as the arithmetic mean) is the most commonly used measure of the center of a dataset. Sometimes the mean can be skewed by the presence of outliers, or data values that are significantly different as compared to the remainder of the dataset. In these instances, the trimmed mean is often used to provide a more representative measure of the center of the dataset, as we will discuss in the following section.

Mean

To calculate the mean, add the values of all the items in a dataset and divide by the number of items. For example, if the scores on your last three exams were 87, 92, and 73, then the mean score would be 87+92+733=8487+92+733=84. If you had a large number of data values, you would proceed in the same way. For example, to calculate the mean value of 50 exam scores, add the 50 scores together and divide by 50. If the 50 scores add up to 4,050, for example, the mean score is 405050405050, or 81.

In data science applications, you will encounter two types of datasets: sample data and population data. Population data represents all the outcomes or measurements that are of interest. Sample data represents outcomes or measurements collected from a subset, or part, of the population of interest. Of course in many applications, collecting data from an entire population is not practical or feasible, and so we often rely on sample data.

The notation xx is used to indicate the sample mean, where the mean is calculated based on data taken from a sample. The notation xx is used to denote the sum of the data values, and nn is used to indicate the number of data values in the sample, also known as the sample size.

The sample mean can be calculated using the following formula:

x=xnx=xn

The notation µµ is used to indicate the population mean, where the mean is calculated based on data taken from the entire population, and NN is used to indicate the number of data values in the population, also known as the population size. The population mean can be calculated using the following formula:

µ=xNµ=xN

The mean can also be determined by its frequency distribution. For every unique data value in the dataset, the frequency distribution gives the number of times, or frequency, that this unique value appears in the dataset. In this type of situation, the mean can be calculated by multiplying each distinct value by its frequency, summing these values, and then dividing this sum by the total number of data values. Here is the corresponding formula for the sample mean using the frequency distribution:

x=x·fnx=x·fn

When all the values in the dataset are unique, this reduces to the previous formula given for the sample mean.

Example 3.1

Problem

During a clinical trial, a sample is taken of 10 patients and pulse rates are measured in beats per minute:

68, 92, 76, 51, 65, 83, 94, 72, 88, 59

Calculate the mean pulse rate for this sample.

Example 3.2

Problem

A college professor records the ages of 25 students in a data science class as shown:

Student Age Number of Students
(Frequency)
19 3
20 4
21 8
22 6
23 2
27 1
31 1
Total 25

Calculate the mean age for this sample of students.

Trimmed Mean

A trimmed mean helps mitigate the effects of outliers, which are data values that are significantly different from most of the other data values in the dataset. In the dataset given in Example 3.1, a pulse rate of 35 or 120 would be considered outlier data values since these pulse rates are significantly different as compared to the rest of the values in Example 3.1. We will see that there is a formal method for determining outliers, and in fact there are several methods to identify outliers in a dataset.

The presence of outlier data values tends to disproportionally skew the mean and produce a potentially misleading result for the mean.

To calculate the trimmed mean for a dataset, first sort the data in ascending order (from smallest to largest). Then decide on a certain percentage of data values to be deleted from the lower and upper ends of the dataset. This might represent the extent of outliers in the dataset; trimmed mean percentages of 10% and 20% are common. Then delete the specified percentage of data values from both the lower end and upper end of the dataset. Then find the mean for the remaining undeleted data values.

As an example, to calculate a 10% trimmed mean, first sort the data values from smallest to largest. Then delete the lower 10% of the data values and delete the upper 10% of the data values. Then calculate the mean for the resulting dataset. Any outliers would tend to be deleted as part of the trimmed mean calculation, and thus the trimmed mean would then be a more representative measure of the center of the data for datasets containing outliers.

Example 3.3

Problem

A real estate agent collects data on a sample of recently sold homes in a certain neighborhood, and the data are shown in the following dataset:

397900, 452600, 507400, 488300, 623400, 573200, 1689300, 403890, 612300, 599000, 2345800, 499000, 525000, 675000, 385000

  1. Calculate the mean of the dataset.
  2. Calculate a 20% trimmed mean rate for the dataset.

Median

The median provides another measure of central tendency for a dataset. The median is generally a better measure of the central tendency when there are outliers (extreme values) in the dataset. Since the median focuses on the middle value of the ordered dataset, the median is preferred when outliers are present because the median is not affected by the numerical values of the outliers.

To determine the median of a dataset, first order the data from smallest to largest, and then find the middle value in the ordered dataset. For example, to find the median value of 50 exam scores, find the score that splits the data into two equal parts. The exam scores for 25 students will be below the median, and 25 students will have exam scores above the median.

If there is an odd number of data values in the dataset, then there will be one data value that represents the middle value, and this is the median. If there is an even number of data values in the dataset, then to find the median, add the two middle values together and divide by 2 (this is essentially finding the mean of the two middle values in the dataset).

Example 3.4

Problem

The same dataset of pulse rates from Example 3.1 is:

68, 92, 76, 51, 65, 83, 94, 72, 88, 59

Calculate the median pulse rate for this sample.

You can also quickly find the sample median of a dataset as follows.

Let nn represent the number of data values in the sample.

  • If nn is odd, then the median is the data value in position n+12n+12.
  • If nn is even, the median is the mean of the observations in position n2n2 and position n2+1n2+1.

For example, let’s say a dataset has 25 data values. Since nn is odd, to identify the position of the median, calculate n+12n+12, which is 25+1225+12, or 13. This indicates that the median is located in the 13th data position.

As another example, let’s say a dataset has 100 data values. Since nn is even, to identify the position of the median, calculate n2n2, which is 10021002, which is 50, and also calculate n2+1n2+1, which is 50+150+1, which is 51. This indicates that the median is calculated as the mean of the 50th and 51st data values.

Mode

Another measure of center is the mode. The mode is the data value that occurs with the greatest frequency. If there are no repeating data values in a dataset, then there is no mode. If two data values occur with same greatest frequency, then there are two modes, and we say the data is bimodal. For example, assume that the weekly closing stock price for a technology stock, in dollars, is recorded for 20 consecutive weeks as follows:

50, 53, 59, 59, 63, 63, 72, 72, 72, 72, 72, 76, 78, 81, 83, 84, 84, 84, 90, 93

To find the mode, determine the most frequent score, which is 72, which occurs five times. Thus, the mode of this dataset is 72.

The mode can also be applied to non-numeric (qualitative) data, whereas the mean and the median can only be applied for numeric (quantitative) data. For example, a restaurant manager might want to determine the mode for responses to customer surveys on the quality of the service of a restaurant, as shown in Table 3.1.

Customer Service
Rating
Number of
Respondents
Excellent 267
Very Good 410
Good 392
Fair 107
Poor 18
Table 3.1 Customer Survey Results for Customer Survey Rating

Based on the survey responses, the mode is the Customer Service Rating of “Very Good,” since this is the data value with the greatest frequency.

Influence of Outliers on Measures of Center

As mentioned earlier, when outliers are present in a dataset, the mean may not represent the center of the dataset, and the median will provide a better measure of center. The reason is that the median focuses on the middle value of the ordered dataset. Thus, any outliers at the lower end of the dataset or any outliers at the upper end of the dataset will not affect the median. Note: A formal method for identifying outliers is presented in Measures of Position when measures of position are discussed. The following example illustrates the point that the median is a better measure of central tendency when potential outliers are present.

Example 3.5

Problem

Suppose that in a small company of 40 employees, one person earns a salary of $3 million per year, and the other 39 individuals each earn $40,000. Which is the better measure of center: the mean or the median?

Using Python for Measures of Center

We learned in What Are Data and Data Science? how the DataFrame.describe() method is used to summarize data. Recall that the method describe() is defined for a DataFrame type object so should be called upon a DataFrame type variable (e.g. given a DataFrame d, use d.describe()).

Figure 3.2 shows the output of DataFrame.describe() on the “Movie Profit” dataset we used in What Are Data and Data Science?, movie_profit.csv. The mean and 50% quartile show the average and median of each column. For example, the average worldwide gross earnings are $410.14 million, and the median earnings are $309.35 million. Note that average and/or median of some columns are not as meaningful as the others. The first column—Unnamed: 0—was simply used as an identifier of each item in the dataset, so the average and mean of this column is not quite useful. DataFrame.describe() still computes the values because it can (and it does not care which column is meaningful to do so or not).

A data table summarizing statistics about 966 items in the “movie profit” dataset, with columns for “unnamed: 0,” “rating,” “duration,” “US gross” and “and “worldwide gross.” The mean and 50th percentile rows are highlighted. The mean rating is about 6.8, the mean duration is about 117.5, the mean US gross earnings are about $156.2 million, and the mean worldwide gross earnings are about $410.1 million. The 50th percentile (median) rating is about 6.8, the median duration is about 116, the median US gross earnings are about $129.2 million, and the median worldwide gross earnings are about $309.3 million.
Figure 3.2 The Output of
DataFrame.describe()
with the Movie Profit Dataset

Exploring Further

Working with Python

See the Python website for more details on using, installing, and working with Python. See this additional documentation, for more specific information on the statistics module.

Citation/Attribution

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution-NonCommercial-ShareAlike License and you must attribute OpenStax.

Attribution information
  • If you are redistributing all or part of this book in a print format, then you must include on every physical page the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
  • If you are redistributing all or part of this book in a digital format, then you must include on every digital page view the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
Citation information

© Dec 19, 2024 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.