Contemporary Mathematics

# 8.2Visualizing Data

Contemporary Mathematics8.2 Visualizing Data

Figure 8.4 Data visualizations can help people quickly understand important features of a dataset. (credit: "Group of diverse people having a business meeting" by Rawpixel Ltd/Flickr, CC BY 2.0)

## Learning Objectives

After completing this section, you should be able to:

1. Create charts and graphs to appropriately represent data.
2. Interpret visual representations of data.
3. Determine misleading components in data displayed visually.

Summarizing raw data is the first step we must take when we want to communicate the results of a study or experiment to a broad audience. However, even organized data can be difficult to read; for example, if a frequency table is large, it can be tough to compare the first row to the last row. As the old saying goes: a picture is worth a thousand words (or, in this case, summary statistics)! Just as our techniques for organizing data depended on the type of data we were looking at, the methods we’ll use for creating visualizations will vary. Let’s start by considering categorical data.

## Visualizing Categorical Data

If the data we’re visualizing is categorical, then we want a quick way to represent graphically the relative numbers of units that fall in each category. When we created the frequency distributions in the last section, all we did was count the number of units in each category and record that number (this was the frequency of that category). Frequencies are nice when we’re organizing and summarizing data; they’re easy to compute, and they’re always whole numbers. But they can be difficult to understand for an outsider who’s being introduced to your data.

Let’s consider a quick example. Suppose you surveyed some people and asked for their favorite color. You communicated your results using a frequency distribution. Jerry is interested in data on favorite colors, so he reads your frequency distribution. The first row shows that twelve people indicated green was their favorite color. However, Jerry has no way of knowing if that’s a lot of people without knowing how many people total took your survey. Twelve is a pretty significant number if only twenty-five people took the survey, but it’s next to nothing if you recorded a thousand responses. For that reason, we will often summarize categorical data not with frequencies, but with proportions. The proportion of data that fall into a particular category is computed by dividing the frequency for that category by the total number of units in the data.

$Proportion of a category=Category frequencyTotal number of data unitsProportion of a category=Category frequencyTotal number of data units$

Proportions can be expressed as fractions, decimals, or percentages.

## Example 8.5

### Finding Proportions

Recall Example 8.2, in which a teacher recorded the responses on the first question of a multiple choice quiz, with five possible responses (A, B, C, D, and E). The raw data was as follows:

 A A C A B B A E A C A A A C E A B A A C A B E E A A C C

We computed a frequency distribution that looked like this:

Response to First Question Frequency
A 14
B 4
C 6
D 0
E 4

$Proportion of a category=Category frequency/Total number of data unitsProportion of a category=Category frequency/Total number of data units$

Now, let's compute the proportions for each category.

## Checkpoint

If you need to round off the results of the computations to get your percentages or decimals, then the sum might not be exactly equal to 1 or 100% in the end due to that rounding error.

1.
In Your Turn 8.2, students in a statistics class were asked to provide their majors. Those results are again listed below:
 Undecided Biology Biology Sociology Political Science Sociology Undecided Undecided Undecided Biology Biology Education Biology Biology Political Science Political Science
You created a frequency distribution:
Major Frequency
Biology 6
Education 1
Political Science 3
Sociology 2
Undecided 4
Now, find the proportions associated with each category. Express your answers as percentages.

Now that we can compute proportions, let’s turn to visualizations. There are two primary visualizations that we’ll use for categorical data: bar charts and pie charts. Both of these data representations work on the same principle: If proportions are represented as areas, then it’s easy to compare two proportions by assessing the corresponding areas. Let’s look at bar charts first.

### Bar Charts

A bar chart is a visualization of categorical data that consists of a series of rectangles arranged side-by-side (but not touching). Each rectangle corresponds to one of the categories. All of the rectangles have the same width. The height of each rectangle corresponds to either the number of units in the corresponding category or the proportion of the total units that fall into the category.

## Example 8.6

### Building a Bar Chart

In Example 8.5, we computed the following proportions:

Response to First Question Frequency Proportion
A 14 50%
B 4 14.3%
C 6 21.4%
D 0 0%
E 4 14.3%

Draw a bar chart to visualize this frequency distribution.

1.
The students in a statistics class were asked to provide their majors. The computed proportions for each of the categories are as follows:
Major Frequency Proportion
Biology 6 37.5%
Education 1 6.3%
Political Science 3 18.8%
Sociology 2 12.5%
Undecided 4 25%
Create a bar graph to visualize these data. Use percentages to label the vertical axis.

In practice, most graphs are now made with computers. You can use Google Sheets, which is available for free from any web browser.

## Video

Now that we’ve explored how bar graphs are made, let’s get some practice reading bar graphs.

## Example 8.7

The bar graph shown gives data on 2020 model year cars available in the United States. Analyze the graph to answer the following questions.

Figure 8.14 (data source: consumerreports.org/cars)
1. What proportion of available cars were sports cars?
2. What proportion of available cars were sedans?
3. Which categories of cars each made up less than 5% of the models available?

The bar graph shows the region of every institution of higher learning in the United States (except for the service academies, like West Point).

Analyze the bar chart to answer the following questions.
1.
Which region contains the largest number of institutions of higher learning?
2.
What proportion of all institutions of higher learning can be found in the Southwest?
3.
Which regions each have under 5% of the total number of institutions of higher learning?

## WORK IT OUT

### Candy Color: Frequency and Distribution

M&Ms, Skittles, and Reese’s Pieces are all candies that have pieces that are uniformly shaped, but which have different colors. Do the colors in each bag appear with the same frequency? Get a bag of one of these candies and make a bar chart to visualize the color distribution.

### Pie Charts

A pie chart consists of a circle divided into wedges, with each wedge corresponding to a category. The proportion of the area of the entire circle that each wedge represents corresponds to the proportion of the data in that category. Pie charts are difficult to make without technology because they require careful measurements of angles and precise circles, both of which are tasks better left to computers.

## Video

Pie charts are sometimes embellished with features like labels in the slices (which might be the categories, the frequencies in each category, or the proportions in each category) or a legend that explains which colors correspond to which categories. When making your own pie chart, you can decide which of those to include. The only rule is that there has to be some way to connect the slices to the categories (either through labels or a legend).

## Example 8.8

### Making Pie Charts

Use the data that follows to generate a pie chart.

Type Percent Type Percent
SUV 43.6% Minivan 5.5%
Sedan 33.6% Hatchback 3.6%
Sports 10.0% Wagon 3.6%
Table 8.1 (data source: www.consumerreports.org/cars)

1.
In Your Turn 8.6, you created a bar chart using data on reported majors from students in a class. Here are those proportions again (sorted from largest to smallest):
Major Proportion
Biology 37.5%
Undecided 25.0%
Political Science 18.8%
Sociology 12.5%
Education 6.3%
Create a pie graph using those data.

## People in Mathematics

### Florence Nightingale

Florence Nightingale (1820–1910) is best remembered today for her contributions in the medical field; after witnessing the horrors of field hospitals that tended to the wounded during the Crimean War, she championed reforms that encouraged sanitary conditions in hospitals. For those efforts, she is today considered the founder of modern nursing.

Figure 8.16 Florence Nightingale's significant contribution to the field of statistical graphics cannot be understated. (credit: "Florence Nightingale" by Library of Congress Prints and Photographs Division/http://hdl.loc.gov/loc.pnp/pp.print, public domain)

Nightingale is also remembered for her contributions in statistics, especially in the ways we visualize data. She developed a version of the pie chart that is today known as a polar area diagram, which she used to visualize the causes of death among the soldiers in the war, highlighting the number of preventable deaths the British Army suffered in that conflict.

In 1859, the Royal Statistical Society honored her for her contributions to the discipline by electing her to join the organization. She was the first woman to be so honored. She was later named an honorary member of the American Statistical Association. Nightingale's status as a revered pioneer in both nursing and statistics is a complex one, because some of her writings and opinions demonstrate a colonialist mindset and disregard for those who lost their lives and lands at the hands of the British. Her core statistical writings indicated that she felt superior to the Indigenous people she was treating. Members of both fields continue to debate her near-iconic role.

## Visualizing Quantitative Data

There are several good ways to visualize quantitative data. In this section, we’ll talk about two types: stem-and-leaf plots and histograms.

### Stem-and-Leaf Plots

Stem-and-leaf plots are visualization tools that fall somewhere between a list of all the raw data and a graph. A stem-and-leaf plot consists of a list of stems on the left and the corresponding leaves on the right, separated by a line. The stems are the numbers that make up the data only up to the next-to-last digit, and the leaves are the final digits. There is one leaf for every data value (which means that leaves may be repeated), and the leaves should be evenly spaced across all stems. These plots are really nothing more than a fancy way of listing out all the raw data; as a result, they shouldn’t be used to visualize large datasets.

This concept can be difficult to understand without referencing an example, so let’s first look at how to read a stem-and-leaf plot.

## Example 8.9

A collector of trading cards records the sale prices (in dollars) of a particular card on an online auction site, and puts the results in a stem-and-leaf plot:

 0 5 8 9 1 0 0 0 3 4 4 5 5 5 5 6 9 9 2 0 0 0 0 5 5 9 9 3 0 0 0 5 5 4 0 0 5 5 6 0
Table 8.2

1. How many prices are represented?
2. What prices represent the five most expensive cards? The five least expensive?
3. What is the full set of data?

The stem-and-leaf plot below shows data collected from a sample of employed people who were asked how far (in miles) they commute each day:
 0 4 6 7 1 0 0 0 2 2 2 4 5 8 8 2 0 5 5 5 3 0 0 5 5 6 4 5 0 6 0
1.
How many data points are represented?
2.
What are the three longest and shortest commutes?
3.
What is the full list of data?

Stem-and-leaf plots are useful in that they give us a sense of the shape of the data. Are the data evenly spread out over the stems, or are some stems “heavier” with leaves? Are the heavy stems on the low side, the high side, or somewhere in the middle? These are questions about the distribution of the data, or how the data are spread out over the range of possible values.

Some words we use to describe distributions are uniform (data are equally distributed across the range), symmetric (data are bunched up in the middle, then taper off in the same way above and below the middle), left-skewed (data are bunched up at the high end or larger values, and taper off toward the low end or smaller values), and right-skewed (data are bunched up at the low end, and taper off toward the high end). See below figures.

Looking back at the stem-and-leaf plot in the previous example, we can see that the data are bunched up at the low end and taper off toward the high end; that set of data is right-skewed. Knowing the distribution of a set of data gives us useful information about the property that the data are measuring.

Now that we have a better idea of how to read a stem-and-leaf plot, we’re ready to create our own.

## Example 8.10

### Constructing a Stem-and-Leaf Plot

An entomologist studying crickets recorded the number of times different crickets (of differing species, genders, etc.) chirped in a one-minute span. The raw data are as follows:

 89 97 82 102 84 99 93 103 120 91 115 105 89 109 107 89 104 82 106 92 101 109 116 103 100 91 85 104 104 106

Construct a stem-and-leaf plot to visualize these results.

1.
This table gives the records of the Major League Baseball teams at the end of the 2019 season:
Team Wins Losses Team Wins Losses
HOU 107 55 PHI 81 81
LAD 106 56 TEX 78 84
NYY 103 59 SFG 77 85
MIN 101 61 CIN 75 87
ATL 97 65 CHW 72 89
OAK 97 65 LAA 72 90
TBR 96 66 COL 71 91
CLE 93 69 SDP 70 92
WSN 93 69 PIT 69 93
STL 91 71 SEA 68 94
MIL 89 73 TOR 67 95
NYM 86 76 KCR 59 103
ARI 85 77 MIA 57 105
BOS 84 78 BAL 54 108
CHC 84 78 DET 47 114
Table 8.4 (source: http://www.mlb.com)
Create a stem-and-leaf plot for the number of wins.

As we mentioned above, stem-and-leaf plots aren’t always going to be useful. For example, if all the data in your dataset are between 20 and 29, then you’ll just have one stem, which isn’t terribly useful. (Although there are methods like stem splitting for addressing that particular problem, we won’t go into those at this time.) On the other end of the spectrum, the data may be so spread out that every stem has only one leaf. (This problem can sometimes be addressed by rounding off the data values to the tens, hundreds, or some other place value, then using that place for the leaves.) Finally, if you have dozens or hundreds (or more) of data values, then a stem-and-leaf plot becomes too unwieldy to be useful. Fortunately, we have other tools we can use.

### Histograms

Histograms are visualizations that can be used for any set of quantitative data, no matter how big or spread out. They differ from a categorical bar chart in that the horizontal axis is labeled with numbers (not ranges of numbers), and the bars are drawn so that they touch each other. The heights of the bars reflect the frequencies in each bin. Unlike with stem-and-leaf plots, we cannot recreate the original dataset from a histogram. However, histograms are easy to make with technology and are great for identifying the distribution of our data. Let’s first create one histogram without technology to help us better understand how histograms work.

## Example 8.11

### Constructing a Histogram

In Example 8.10, we built a stem-and-leaf plot for the number of chirps made by crickets in one minute. Here are the raw data that we used then:

 89 97 82 102 84 99 115 105 89 109 107 89 101 109 116 103 100 91 93 103 120 91 85 104 104 82 106 92 104 106

Construct a histogram to visualize these results.

1.
In Your Turn 10, you made a stem-and-leaf plot of the number of wins for each MLB team in 2019, using this set of data:
Team Wins Losses
HOU 107 55
NYY 103 59
MIN 101 61
ATL 97 65
OAK 97 65
TBR 96 66
CLE 93 69
WSN 93 69
STL 91 71
MIL 89 73
NYM 86 76
ARI 85 77
BOS 84 78
CHC 84 78
PHI 81 81
TEX 78 84
SFG 77 85
CIN 75 87
CHW 72 89
LAA 72 90
COL 71 91
SDP 70 92
PIT 69 93
SEA 68 94
TOR 67 95
KCR 59 103
MIA 57 105
BAL 54 108
DET 47 114
Table 8.5
Create a histogram for the number of wins. Use bins of width 10, starting with a bin for 40-49 (so that your histogram reflects the stem-and-leaf plot you made earlier).

Now that we’ve seen the connection between stem-and-leaf plots and histograms, we are ready to look at how we can use Google Sheets to build histograms.

## Video

Let’s use Google Sheets to create a histogram for a large dataset.

## Example 8.12

### Creating a Histogram in Google Sheets

The data in “AvgSAT” contains the average SAT score for students attending every institution of higher learning in the US for which data is available. Create a histogram in Google Sheets of the average SAT scores. Use bins of width 50. Are the data uniformly distributed, symmetric, left-skewed, or right-skewed?

1.

The file “InState” contains in-state tuition costs (in dollars) for every institution of higher learning in the United States for which data is available (data from data.ed.gov). Create a histogram in Google Sheets of in-state tuition costs. Choose a bin size that you think works well. Are the data uniformly distributed, symmetric, left-skewed, or right-skewed?

### Bar Charts for Labeled Data

Sometimes we have quantitative data where each value is labeled according to the source of the data. For example, in the Your Turn above, you looked at in-state tuition data. Every value you used to create that histogram was associated with a school; the schools are the labels. In YOUR TURN 8.11, you found a histogram of the wins of every Major League Baseball team in 2019. Each of those win totals had a label: the team. If we’re interested in visualizing differences among the different teams, or schools, or whatever the labels are, we create a different version of the bar graph known as a bar chart for labeled data.

These graphs are made in Google Sheets in exactly the same way as regular bar graphs. The only change is that the vertical axis will be labeled with the units for your quantitative data instead of just “Frequency.”

## Example 8.13

### Building a Bar Chart for Labeled Data

The following table shows the gross domestic product (GDP) for the United States for the years 2010 to 2019:

Year GDP (in $trillions) Year GDP (in$ trillions)
2010 14.992 2015 18.225
2011 15.543 2016 18.715
2012 16.197 2017 19.519
2013 16.785 2018 20.580
2014 17.527 2019 21.433
Table 8.6 (source: https://data.worldbank.org)

Construct a histogram that represents these data.

1.
The following table shows the world record times (as of February 2020) of the various 100m women’s swimming events in international competition:
Event Time Name Nationality
Freestyle 51.71 Sarah Sjöström Sweden
Backstroke 57.57 Regan Smith United States
Breaststroke 64.10 Lilly King United States
Butterfly 55.48 Sarah Sjöström Sweden
Make a visualization of these times using the events as the labels.

Graphical representations of data can be manipulated in ways that intentionally mislead the reader. There are two primary ways this can be done: by manipulating the scales on the axes and by manipulating or misrepresenting areas of bars. Let’s look at some examples of these.

## Example 8.14

The table below shows the teams, and their payrolls, in the English Premier League, the top soccer organization in the United Kingdom.

Team Salary (£1,000,000s) Team Salary (£1,000,000s)
Manchester United F.C. 175.7 Newcastle United F.C. 56.9
Manchester City F.C. 136.5 Aston Villa F.C. 52.3
Chelsea F.C. 132.8 Fulham F.C. 52.1
Arsenal F.C. 130.7 Southampton F.C. 49.6
Tottenham Hotspur F.C. 129.2 Wolverhampton Wanderers F.C. 49.5
Liverpool F.C. 118.6 Brighton & Hove Albion 43.7
Crystal Palace 85.0 Burnley F.C. 35.5
Everton F.C. 82.5 West Bromwich Albion F.C. 23.8
Leicester City 73.7 Leeds United F.C. 22.5
West Ham United F.C. 69.2 Sheffield United F.C. 19.7
Table 8.8 (source: www.spotrac.com)

How might someone present this data in a misleading way?

## Checkpoint

Always check the horizontal axis on histograms! The widths of all the bars should be equal.

1.
Take a look again at the win totals for teams in Major League Baseball in 2019 :
Team Wins Team Wins
HOU 107 PHI 81
NYY 103 SFG 77
MIN 101 CIN 75
ATL 97 CHW 72
OAK 97 LAA 72
TBR 96 COL 71
CLE 93 SDP 70
WSN 93 PIT 69
STL 91 SEA 68
MIL 89 TOR 67
NYM 86 KCR 59
ARI 85 MIA 57
BOS 84 BAL 54
CHC 84 DET 47
Table 8.9 (source: https://www.espn.com/mlb/standings/_/season/2019/view)
Make one good and one misleading chart showing the number of wins by the top ten teams. Then, looking at all the teams, make one good and one misleading histogram for the win totals.

## Who Knew?

### Napoleon's Failed Invasion

One of the most famous data visualizations ever created is the cartographic depiction by Charles Joseph Minard of Napoleon’s disastrous attempted invasion of Russia.

Figure 8.28 Minard’s Napoleon Map (credit: Carte de Charles Minard/Wikimedia, public domain)

Minard’s chart is remarkable in that it shows not just how the size of Napoleon’s army shrank drastically over time, but also the location on the map, the direction the army was traveling at the time, and the temperature during the retreat.

The medical office at a zoo tracks the animals it treats each week. The table shows the classifications for a particular week:
 Mammal Mammal Reptile Bird Mammal Amphibian Mammal Mammal Mammal Reptile Mammal Bird Mammal Bird Reptile Reptile Amphibian Mammal Bird Mammal Amphibian Mammal Mammal Bird
7.
Create a bar graph of the data without technology.
8.
Create a pie chart of the data using technology.
Employees at a college help desk track the number of people who request assistance each week. The table gives a sample of the results :
 142 153 158 156 141 143 139 158 156 146 137 153 136 127 157 148 132 139 155 167 143 168 133 157 138 156 164 130 148 136
9.
Make a stem-and-leaf plot of the data.
10.
Create a histogram of the data. Use bins of width 5.
The following are data on the admission rates of the different branch campuses in the University of California system, along with the out-of-state tuition and fee cost:
Berkeley 0.1484 43,176
Davis 0.4107 43,394
Irvine 0.2876 42,692
Los Angeles 0.1404 42,218
Merced 0.6617 42,530
Riverside 0.5057 42,819
San Diego 0.3006 43,159
Santa Barbara 0.322 43,383
Santa Cruz 0.4737 42,952
(source: https://data.ed.gov)
11.
Create a bar graph that illustrates the differences in admission rates among the different campuses.
12.
Create two bar graphs for the out-of-state tuition. One should give an unbiased perception of the differences among them, and the other should overemphasize those differences.

## Section 8.2 Exercises

The table below shows the answers to the question, “Which social media platform, if any, do you use most frequently?”

1 .
Make a bar chart to visualize these responses.
2 .
Make a pie chart to visualize these responses.

A sample of students at a large university were asked whether they were full-time students living on campus (Full-Time Residential, FTR), full-time students who commuted (FTC), or part-time students (PT). The raw data are in the table below:

 FTR FTR FTC PT FTR PT FTR FTC FTR FTC FTC FTR FTR PT FTC FTC FTC PT FTC FTC PT FTR FTC PT FTC PT FTR PT FTC FTC FTR PT
3 .
Make a bar chart to visualize these responses.
4 .
Make a pie chart to visualize these responses.
Students in a statistics class were asked how many countries (besides their home countries) they had visited; the table below gives the raw responses:
 0 2 1 1 3 2 0 2 0 1 0 2 0 1 0 1 1 0 1 0 0 0 0 2 1 1 0 1 1 0
5 .
Create a bar graph visualizing these data (treating the responses as categorical).
6 .
Create a pie chart visualizing these data.
The purchasing department for a chain of bookstores wants to make sure they’re buying the right types of books to put on the shelves, so they take a sample of 20 books that customers bought in the last five days and record the genres:
7 .
Create a bar graph to visualize these data.
8 .
Create a pie chart to visualize these data.
An elementary school class is administered a standardized test for which scores range from 0 to 100, as shown below:
 60 54 71 80 63 72 70 88 88 67 74 79 50 99 64 98 55 64 86 92 72 65 88 80 65
(source: http://www.nwslsoccer.com)
9 .
Make a stem-and-leaf plot to visualize these results.
10 .
Make a histogram to visualize these results. Use bins of width 10.
The following table gives the final results for the 2021 National Women’s Soccer League season. The columns are standings points (PTS; teams earn three points for a win and one point for a tie), wins (W), losses (L), ties (T), goals scored by that team (GF), and goals scored against that team (GA).
Team PTS W L T GF GA
Portland Thorns FC 44 13 6 5 33 17
OL Reign 42 13 8 3 37 24
Washington Spirit 39 11 7 6 29 26
Chicago Red Stars 38 11 8 5 28 28
NJ/NY Gotham FC 35 8 5 11 29 21
North Carolina Courage 33 9 9 6 28 23
Houston Dash 32 9 10 5 31 31
Orlando Pride 28 7 10 7 27 32
Racing Louisville FC 22 5 12 7 21 40
Kansas City Current 16 3 14 7 15 36
(source: http://www.nwslsoccer.com)
11 .
Make a stem-and-leaf plot for PTS.
12 .
Make a histogram for PTS, using bins of width 5.
13 .
Make a histogram for GF, using bins of width 5.
14 .
Make a histogram for GA, using bins of width 5.
For the following exercises, use the "CUNY" dataset–which gives the location (borough) of each college in the City University of New York (CUNY) system, the highest degree offered, and the proportions of total degrees awarded in a partial list of disciplines–to identify the right visualization to address each question. Then, create those visualizations.
15 .
What is the highest degree offered in colleges across the CUNY system?
16 .
What is the distribution of the proportion of degrees awarded in Information Science across the CUNY system?
17 .
In which boroughs are the CUNY colleges located?
18 .
What are the proportions of degrees awarded across the listed humanities fields (Foreign Language, English, Humanities, Philosophy & Religion, History) at City College?
19 .
What proportions of degrees are awarded in Social Service at the different institutions located in Manhattan?
For the following exercises, use the data found in the "Receivers" dataset on the top 25 receivers (by number of receptions; data collected from pro-football-reference.com) in the NFL during the 2020 season.
20 .
Make a stem-and-leaf plot for the longest receptions (“Long”).
21 .
Make a stem-and-leaf plot for receptions.
22 .
Make a histogram for yards.
23 .
Make a histogram for yards per reception (“Yds/Rec”).
24 .
Make a histogram for the longest receptions (“Long”).
25 .
Make a histogram for receptions.
26 .
Make a histogram for age.
27 .
Describe the distribution of age as left-skewed, symmetric, or right-skewed.
28 .
Describe the distribution of receptions as left-skewed, symmetric, or right-skewed.
29 .
Describe the distribution of yards as left-skewed, symmetric, or right-skewed.
30 .
Describe the distribution of touchdowns (“TD”) as left-skewed, symmetric, or right-skewed.
31 .
Describe the distribution of longest receptions as left-skewed, symmetric, or right-skewed.