Dr. Shaun V. Ault; Dr. Soohyun Nam Liao; Larry Musolino

Learning Outcomes

By the end of this section, you should be able to:

3.1.1 Define and calculate mean, trimmed mean, median, and mode for a dataset.
3.1.2 Determine the effect of outliers on the mean and the median.
3.1.3 Use Python to calculate measures of center for a dataset.

Measures of center are statistical measurements that provide a central, or typical, representation of a dataset. These measures can help indicate where the bulk of the data is concentrated and are often called the data’s central tendency. The most widely used measures of the center of a dataset are the mean (average), the median, and the mode.

Mean and Trimmed Mean

The mean, or average, (sometimes referred to as the arithmetic mean) is the most commonly used measure of the center of a dataset. Sometimes the mean can be skewed by the presence of outliers, or data values that are significantly different as compared to the remainder of the dataset. In these instances, the trimmed mean is often used to provide a more representative measure of the center of the dataset, as we will discuss in the following section.

Mean

To calculate the mean, add the values of all the items in a dataset and divide by the number of items. For example, if the scores on your last three exams were 87, 92, and 73, then the mean score would be $\frac{87 + 92 + 73}{3} = 84$ . If you had a large number of data values, you would proceed in the same way. For example, to calculate the mean value of 50 exam scores, add the 50 scores together and divide by 50. If the 50 scores add up to 4,050, for example, the mean score is $\frac{4050}{50}$ , or 81.

In data science applications, you will encounter two types of datasets: sample data and population data. Population data represents all the outcomes or measurements that are of interest. Sample data represents outcomes or measurements collected from a subset, or part, of the population of interest. Of course in many applications, collecting data from an entire population is not practical or feasible, and so we often rely on sample data.

The notation $\bar{x}$ is used to indicate the sample mean, where the mean is calculated based on data taken from a sample. The notation $\sum x$ is used to denote the sum of the data values, and $n$ is used to indicate the number of data values in the sample, also known as the sample size.

The sample mean can be calculated using the following formula:

\bar{x} = \frac{\sum x}{n}

The notation $µ$ is used to indicate the population mean, where the mean is calculated based on data taken from the entire population, and $N$ is used to indicate the number of data values in the population, also known as the population size. The population mean can be calculated using the following formula:

µ = \frac{\sum x}{N}

The mean can also be determined by its frequency distribution. For every unique data value in the dataset, the frequency distribution gives the number of times, or frequency, that this unique value appears in the dataset. In this type of situation, the mean can be calculated by multiplying each distinct value by its frequency, summing these values, and then dividing this sum by the total number of data values. Here is the corresponding formula for the sample mean using the frequency distribution:

\bar{x} = \frac{\sum x \cdot f}{n}

When all the values in the dataset are unique, this reduces to the previous formula given for the sample mean.

Example 3.1

Problem

During a clinical trial, a sample is taken of 10 patients and pulse rates are measured in beats per minute:

68, 92, 76, 51, 65, 83, 94, 72, 88, 59

Calculate the mean pulse rate for this sample.

Solution

Add the 10 data values, and the sum is 748. Divide this sum by the number of data values, which is 10. The result is:

\bar{x} = \frac{\sum x}{n} = \frac{748}{10} = 74.8

Example 3.2

Problem

A college professor records the ages of 25 students in a data science class as shown:

Student Age	Number of Students (Frequency)
19	3
20	4
21	8
22	6
23	2
27	1
31	1
Total	25

Calculate the mean age for this sample of students.

Solution

Substitute the values from the table in the following formula:

\bar{x} = \frac{\sum x \cdot f}{n} = \frac{19 \cdot 3 + 20 \cdot 4 + 21 \cdot 8 + 22 \cdot 6 + 23 \cdot 2 + 27 \cdot 1 + 31 \cdot 1}{25} = \frac{541}{25} = 21.64

Trimmed Mean

A trimmed mean helps mitigate the effects of outliers, which are data values that are significantly different from most of the other data values in the dataset. In the dataset given in Example 3.1, a pulse rate of 35 or 120 would be considered outlier data values since these pulse rates are significantly different as compared to the rest of the values in Example 3.1. We will see that there is a formal method for determining outliers, and in fact there are several methods to identify outliers in a dataset.

The presence of outlier data values tends to disproportionally skew the mean and produce a potentially misleading result for the mean.

To calculate the trimmed mean for a dataset, first sort the data in ascending order (from smallest to largest). Then decide on a certain percentage of data values to be deleted from the lower and upper ends of the dataset. This might represent the extent of outliers in the dataset; trimmed mean percentages of 10% and 20% are common. Then delete the specified percentage of data values from both the lower end and upper end of the dataset. Then find the mean for the remaining undeleted data values.

As an example, to calculate a 10% trimmed mean, first sort the data values from smallest to largest. Then delete the lower 10% of the data values and delete the upper 10% of the data values. Then calculate the mean for the resulting dataset. Any outliers would tend to be deleted as part of the trimmed mean calculation, and thus the trimmed mean would then be a more representative measure of the center of the data for datasets containing outliers.

Example 3.3

Problem

A real estate agent collects data on a sample of recently sold homes in a certain neighborhood, and the data are shown in the following dataset:

397900, 452600, 507400, 488300, 623400, 573200, 1689300, 403890, 612300, 599000, 2345800, 499000, 525000, 675000, 385000

Calculate the mean of the dataset.
Calculate a 20% trimmed mean rate for the dataset.

Solution

For the mean, add the 15 data values, and the sum is 10,777,090. Divide this sum by the number of data values, which is 15. The result is:

$\bar{x} = \frac{\sum x}{n} = \frac{10,777,090}{15} = 718,472.70$
For the trimmed mean, first order the data from smallest to largest. The sorted dataset is:

385000, 397900, 403890, 452600, 488300, 499000, 507400, 525000, 573200, 599000, 612300, 623400, 675000, 1689300, 2345800

Twenty percent of 15 data values is 3, and this indicates that 3 data values are to be deleted from each of the lower end and upper end of the dataset. The resulting 9 undeleted data values are:

452600, 488300, 499000, 507400, 525000, 573200, 599000, 612300, 623400

Then find the mean for the remaining data values. The sum of these 9 data values is 4,880,200. Divide this sum by the number of data values (9). The result is:

$\bar{x} = \frac{\sum x}{n} = \frac{4,880,200}{9} = 542,244.40$

Notice how the mean calculated in Part (a) is significantly larger as compared to the trimmed mean calculated in Part (b). The reason is the presence of several large outlier home prices. Once these outlier data values are removed by the trimmed mean calculation, the resulting trimmed mean is more representative of the typical home price in this neighborhood as compared to the mean.

Median

The median provides another measure of central tendency for a dataset. The median is generally a better measure of the central tendency when there are outliers (extreme values) in the dataset. Since the median focuses on the middle value of the ordered dataset, the median is preferred when outliers are present because the median is not affected by the numerical values of the outliers.

To determine the median of a dataset, first order the data from smallest to largest, and then find the middle value in the ordered dataset. For example, to find the median value of 50 exam scores, find the score that splits the data into two equal parts. The exam scores for 25 students will be below the median, and 25 students will have exam scores above the median.

If there is an odd number of data values in the dataset, then there will be one data value that represents the middle value, and this is the median. If there is an even number of data values in the dataset, then to find the median, add the two middle values together and divide by 2 (this is essentially finding the mean of the two middle values in the dataset).

Example 3.4

Problem

The same dataset of pulse rates from Example 3.1 is:

68, 92, 76, 51, 65, 83, 94, 72, 88, 59

Calculate the median pulse rate for this sample.

Solution

First, order the 10 data values from smallest to largest. Divide this sum by the number of data values, which is 10. The result is:

51, 59, 65, 68, 72, 76, 83, 88, 92, 94

Since there is an even number of data values, add the two middle values together and divide by 2.
The two middle values are 72 and 76.

Median = \frac{72 + 76}{2} = \frac{148}{2} = 74

You can also quickly find the sample median of a dataset as follows.

Let $n$ represent the number of data values in the sample.

If $n$ is odd, then the median is the data value in position $\frac{n + 1}{2}$ .
If $n$ is even, the median is the mean of the observations in position $\frac{n}{2}$ and position $\frac{n}{2} + 1$ .

For example, let’s say a dataset has 25 data values. Since $n$ is odd, to identify the position of the median, calculate $\frac{n + 1}{2}$ , which is $\frac{25 + 1}{2}$ , or 13. This indicates that the median is located in the 13th data position.

As another example, let’s say a dataset has 100 data values. Since $n$ is even, to identify the position of the median, calculate $\frac{n}{2}$ , which is $\frac{100}{2}$ , which is 50, and also calculate $\frac{n}{2} + 1$ , which is $50 + 1$ , which is 51. This indicates that the median is calculated as the mean of the 50th and 51st data values.

Mode

Another measure of center is the mode. The mode is the data value that occurs with the greatest frequency. If there are no repeating data values in a dataset, then there is no mode. If two data values occur with same greatest frequency, then there are two modes, and we say the data is bimodal. For example, assume that the weekly closing stock price for a technology stock, in dollars, is recorded for 20 consecutive weeks as follows:

50, 53, 59, 59, 63, 63, 72, 72, 72, 72, 72, 76, 78, 81, 83, 84, 84, 84, 90, 93

To find the mode, determine the most frequent score, which is 72, which occurs five times. Thus, the mode of this dataset is 72.

The mode can also be applied to non-numeric (qualitative) data, whereas the mean and the median can only be applied for numeric (quantitative) data. For example, a restaurant manager might want to determine the mode for responses to customer surveys on the quality of the service of a restaurant, as shown in Table 3.1.

Customer Service Rating	Number of Respondents
Excellent	267
Very Good	410
Good	392
Fair	107
Poor	18

Table 3.1 Customer Survey Results for Customer Survey Rating

Based on the survey responses, the mode is the Customer Service Rating of “Very Good,” since this is the data value with the greatest frequency.

Influence of Outliers on Measures of Center

As mentioned earlier, when outliers are present in a dataset, the mean may not represent the center of the dataset, and the median will provide a better measure of center. The reason is that the median focuses on the middle value of the ordered dataset. Thus, any outliers at the lower end of the dataset or any outliers at the upper end of the dataset will not affect the median. Note: A formal method for identifying outliers is presented in Measures of Position when measures of position are discussed. The following example illustrates the point that the median is a better measure of central tendency when potential outliers are present.

Example 3.5

Problem

Suppose that in a small company of 40 employees, one person earns a salary of $3 million per year, and the other 39 individuals each earn $40,000. Which is the better measure of center: the mean or the median?

Solution

The mean, in dollars, would be arrived at mathematically as follows:

\bar{x} = \frac{3,000,000 + 39(40,000)}{40} = 114,000

However, the median would be $40,000 since this is the middle data value in the ordered dataset. There are 39 people who earn $40,000 and one person who earns $3,000,000.

Notice that the mean is not representative of the typical value in the dataset since $114,000 is not reflective of the average salary for most employees (who are earning $40,000). The median is a much better measure of the “average” than the mean in this case because 39 of the values are $40,000 and one is $3,000,000. The data value of $3,000,000 is an outlier. The median result of $40,000 gives us a better sense of the center of the dataset.

Using Python for Measures of Center

We learned in What Are Data and Data Science? how the DataFrame.describe() method is used to summarize data. Recall that the method describe() is defined for a DataFrame type object so should be called upon a DataFrame type variable (e.g. given a DataFrame d, use d.describe()).

Figure 3.2 shows the output of DataFrame.describe() on the “Movie Profit” dataset we used in What Are Data and Data Science?, movie_profit.csv. The mean and 50% quartile show the average and median of each column. For example, the average worldwide gross earnings are $410.14 million, and the median earnings are $309.35 million. Note that average and/or median of some columns are not as meaningful as the others. The first column—Unnamed: 0—was simply used as an identifier of each item in the dataset, so the average and mean of this column is not quite useful. DataFrame.describe() still computes the values because it can (and it does not care which column is meaningful to do so or not).

A data table summarizing statistics about 966 items in the “movie profit” dataset, with columns for “unnamed: 0,” “rating,” “duration,” “US gross” and “and “worldwide gross.” The mean and 50th percentile rows are highlighted. The mean rating is about 6.8, the mean duration is about 117.5, the mean US gross earnings are about $156.2 million, and the mean worldwide gross earnings are about $410.1 million. The 50th percentile (median) rating is about 6.8, the median duration is about 116, the median US gross earnings are about $129.2 million, and the median worldwide gross earnings are about $309.3 million.

Figure 3.2 The Output of

DataFrame.describe()

with the Movie Profit Dataset

Exploring Further

Working with Python

See the Python website for more details on using, installing, and working with Python. See this additional documentation, for more specific information on the statistics module.

3.1 Measures of Center

Learning Outcomes

Mean and Trimmed Mean

Mean

Problem

Solution

Problem

Solution

Trimmed Mean

Problem

Solution

Median

Problem

Solution

Mode

Influence of Outliers on Measures of Center

Problem

Solution

Using Python for Measures of Center

Working with Python