Donna Kirk

8.2 Visualizing Data

A group of people are at a table, with their hands shown holding pens. They're all pointing to data on a piece of paper. — Figure 8.4 Data visualizations can help people quickly understand important features of a dataset. (credit: "Group of diverse people having a business meeting" by Rawpixel Ltd/Flickr, CC BY 2.0)

Learning Objectives

After completing this section, you should be able to:

Create charts and graphs to appropriately represent data.
Interpret visual representations of data.
Determine misleading components in data displayed visually.

Summarizing raw data is the first step we must take when we want to communicate the results of a study or experiment to a broad audience. However, even organized data can be difficult to read; for example, if a frequency table is large, it can be tough to compare the first row to the last row. As the old saying goes: a picture is worth a thousand words (or, in this case, summary statistics)! Just as our techniques for organizing data depended on the type of data we were looking at, the methods we’ll use for creating visualizations will vary. Let’s start by considering categorical data.

Visualizing Categorical Data

If the data we’re visualizing is categorical, then we want a quick way to represent graphically the relative numbers of units that fall in each category. When we created the frequency distributions in the last section, all we did was count the number of units in each category and record that number (this was the frequency of that category). Frequencies are nice when we’re organizing and summarizing data; they’re easy to compute, and they’re always whole numbers. But they can be difficult to understand for an outsider who’s being introduced to your data.

Let’s consider a quick example. Suppose you surveyed some people and asked for their favorite color. You communicated your results using a frequency distribution. Jerry is interested in data on favorite colors, so he reads your frequency distribution. The first row shows that twelve people indicated green was their favorite color. However, Jerry has no way of knowing if that’s a lot of people without knowing how many people total took your survey. Twelve is a pretty significant number if only twenty-five people took the survey, but it’s next to nothing if you recorded a thousand responses. For that reason, we will often summarize categorical data not with frequencies, but with proportions. The proportion of data that fall into a particular category is computed by dividing the frequency for that category by the total number of units in the data.

$Proportion of a category = \frac{Category frequency}{Total number of data units}$

Proportions can be expressed as fractions, decimals, or percentages.

Example 8.5

Finding Proportions

Recall Example 8.2, in which a teacher recorded the responses on the first question of a multiple choice quiz, with five possible responses (A, B, C, D, and E). The raw data was as follows:

A

C

A

B

A

E

A

C

A

C

E

A

B

A

C

A

B

E

A

C

We computed a frequency distribution that looked like this:

Response to First Question	Frequency
A	14
B	4
C	6
D	0
E	4

$Proportion of a category = Category frequency / Total number of data units$

Now, let's compute the proportions for each category.

Solution

Step 1: In order to compute a proportion, we need the frequency (which we have in the table above) and the total number of units that are represented in our data. We can find that by adding up the frequencies from all the categories: $14 + 4 + 6 + 0 + 4 = 28$ .

Step 2: To find the proportions, we divide the frequency by the total. For the first category (“A”), the proportion is $\frac{14}{28} = \frac{1}{2} = 0.5 = 50 % .$ We can compute the other proportions similarly, filling in the rest of the table:

Response to First Question	Frequency	Proportion
A	14	$\frac{14}{28} = 50 %$
B	4	$\frac{4}{28} = 14.3 %$
C	6	$\frac{6}{28} = 21.4 %$
D	0	$\frac{0}{28} = 0 %$
E	4	$\frac{4}{28} = 14.3 %$

Step 3: Check your work: If you add up your proportions, you should get 1 (if you’re using fractions or decimals) or 100% (if you’re using percentages). In this case, $50 % + 14.3 % + 21.4 % + 0 % + 14.3 % = 100 % .$

Checkpoint

If you need to round off the results of the computations to get your percentages or decimals, then the sum might not be exactly equal to 1 or 100% in the end due to that rounding error.

Your Turn 8.5

1.

In Your Turn 8.2, students in a statistics class were asked to provide their majors. Those results are again listed below:

Undecided	Biology	Biology	Sociology
Political Science	Sociology	Undecided	Undecided
Undecided	Biology	Biology	Education
Biology	Biology	Political Science	Political Science

You created a frequency distribution:

Major	Frequency
Biology	6
Education	1
Political Science	3
Sociology	2
Undecided	4

Now, find the proportions associated with each category. Express your answers as percentages.

Now that we can compute proportions, let’s turn to visualizations. There are two primary visualizations that we’ll use for categorical data: bar charts and pie charts. Both of these data representations work on the same principle: If proportions are represented as areas, then it’s easy to compare two proportions by assessing the corresponding areas. Let’s look at bar charts first.

Bar Charts

A bar chart is a visualization of categorical data that consists of a series of rectangles arranged side-by-side (but not touching). Each rectangle corresponds to one of the categories. All of the rectangles have the same width. The height of each rectangle corresponds to either the number of units in the corresponding category or the proportion of the total units that fall into the category.

Example 8.6

Building a Bar Chart

In Example 8.5, we computed the following proportions:

Response to First Question	Frequency	Proportion
A	14	50%
B	4	14.3%
C	6	21.4%
D	0	0%
E	4	14.3%

Draw a bar chart to visualize this frequency distribution.

Solution

Step 1: To start, we’ll draw axes with the origin (the point where the axes meet) at the bottom left:

A blank bar chart with horizontal and vertical axes. — Figure 8.5

Step 2: Next, we’ll place our categories evenly spaced along the bottom of the horizontal axis. The order doesn’t really matter, but if the categories have some sort of natural order (like in this case, where the responses are labeled A to E), it’s best to maintain that order. We'll also label the horizontal axis:

A bar chart. The horizontal axis representing response on question 1 ranges from A to E. The vertical axis is blank. — Figure 8.6

Step 3: Now, we have a decision to make: Will we use frequencies to define the height of our rectangles, or will we use proportions? Let’s try it both ways. First, let’s use frequencies. Notice that our frequencies run from zero to 14; this will correspond to the scale we put on the vertical axis. If we put a tick mark for every whole number between 0 and 14, the result will be pretty crowded; let’s instead put a mark on the multiples of 3 or 5:

A bar chart. The horizontal axis representing response on question 1 ranges from A to E. The vertical axis representing frequency ranges from 3 to 15, in increments of 3. — Figure 8.7

Step 4: Now, let’s draw in the first rectangle. The frequency associated with “A” is 14. So we’ll go to 14 on the vertical axis, and place a mark at that height above the “A” label:

A bar chart plots a horizontal dashed line. The horizontal axis representing response on question 1 ranges from A to E. The vertical axis representing frequency ranges from 3 to 15, in increments of 3. The bar chart infers the following data. A: 14. — Figure 8.8

Step 5: Then, draw vertical lines straight down from the edges of your mark to make a rectangle:

A bar chart shows a vertical bar. The horizontal axis representing response on question 1 ranges from A to E. The vertical axis representing frequency ranges from 3 to 15, in increments of 3. The bar chart infers the following data. A: 14. — Figure 8.9

Step 6: Finally, we can build the rest of the rectangles, making sure that the bases all have the same length of the base = width of the rectangle, and the rectangles don’t touch. Notice that, since the frequency for “D” is zero, that category has no rectangle (but we’ll leave a space there so the reader can see that there is a category with frequency zero). Here’s the result:

A bar chart plots frequency versus responses. The horizontal axis representing response on question 1 ranges from A to E. The vertical axis representing frequency ranges from 3 to 15, in increments of 3. The bar graph infers the following data. A: 14; B: 4; C: 6; E: 4. — Figure 8.10

Step 7: That’s it! Now, let’s use proportions instead of frequencies. We'll label the vertical axis with evenly spaced numbers that run the full range of the percentages in our table: 0% to 50%. We can divide that into five equal parts (so that each has width 10%), and use that to label our vertical axis:

A bar chart. The horizontal axis representing response on question 1 ranges from A to E. The vertical axis representing percent ranges from 10 percent to 50 percent, in increments of 10. — Figure 8.11

Step 8: Then, we can fill in the rectangles just as we did before. The height of the “A” rectangle is 50%, the “B” rectangle goes up to 14.3%, “C” goes to 21.4%, there is no rectangle for “D” (since its proportion is 0%), and the “E” rectangle also goes up to 14.3%:

A bar chart plots percent versus responses on question 1. The horizontal axis representing response on question 1 ranges from A to E. The vertical axis representing percent ranges from 10 percent to 50 percent, in increments of 10. The graph infers the following data. A: 50; B: 14.3; C: 21.4; E: 14.3. Note: all values are approximate. — Figure 8.12

Step 9: Notice that the rectangles are basically identical in our two final bar charts. That’s no coincidence! Bar charts that use proportions and those that use frequencies will always look identical (which is why it doesn’t really matter much which option you choose). Here’s why: look at the bars for “B” and “C”. The frequencies for these are 4 and 6 respectively. Notice that 6 is 50% bigger than 4 (since $6 = 1.5 \times 4$ ), which means that the “C” bar will be 50% higher than the “B” bar. Now look at the same bars using proportions: since $21.4 % = 1.5 \times 14.3 %$ , the bar for “C” will be 50% higher than the bar for “B.” The same relationships hold for the other bars, too.

A bar chart plots frequency and percent versus responses on question 1. The horizontal axis representing response on question 1 ranges from A to E. The vertical axis on the left representing frequency ranges from 3 to 15, in increments of 3. The vertical axis on the right representing percent ranges from 10 percent to 50 percent, in increments of 10. The graph infers the following data. A: 14, 50 percent; B: 4, 14.3 percent; C: 6, 21.4 percent; E: 4, 14.3 percent. Note: all values are approximate. — Figure 8.13

Your Turn 8.6

1.

The students in a statistics class were asked to provide their majors. The computed proportions for each of the categories are as follows:

Major	Frequency	Proportion
Biology	6	37.5%
Education	1	6.3%
Political Science	3	18.8%
Sociology	2	12.5%
Undecided	4	25%

Create a bar graph to visualize these data. Use percentages to label the vertical axis.

In practice, most graphs are now made with computers. You can use Google Sheets, which is available for free from any web browser.

Video

Make a Simple Bar Graph in Google Sheets

Now that we’ve explored how bar graphs are made, let’s get some practice reading bar graphs.

Example 8.7

Reading Bar Graphs

The bar graph shown gives data on 2020 model year cars available in the United States. Analyze the graph to answer the following questions.

A bar graph titled, 2020 model year cars available in the US. The horizontal axis represents cars. The vertical axis representing percent ranges from 0 to 50, in increments of 5. The graph infers the following data. Hatchback: 4; Minivan: 6; Sedan: 34; Sports: 10; SUV: 44; Wagon: 4. Note: all values are approximate. — Figure 8.14 (data source: consumerreports.org/cars)

What proportion of available cars were sports cars?
What proportion of available cars were sedans?
Which categories of cars each made up less than 5% of the models available?

Solution

The bar for sports cars goes up to 10%, so the proportion of models that are considered sports cars is 10%.
The bar corresponding to sedan goes up past 30% but not quite to 35%. It looks like the proportion we want is between 33% and 34%.
We’re looking for the bars that don’t make it all the way to the 5% line. Those categories are hatchback and wagon.

Your Turn 8.7

The bar graph shows the region of every institution of higher learning in the United States (except for the service academies, like West Point).

A bar graph titled, regions of the institution of higher education in the US. The horizontal axis represents cities. The vertical axis representing percent ranges from 0 to 30, in increments of 5. The bar graph infers the following data. Far West: 14; Great Lakes: 14.5; Mid East: 17; New England: 5.5; Outlying Areas: 2.5; Plains: 8; Rocky Mountains: 4; Southeast: 25; Southwest: 10.5. Note: all values are approximate.

Analyze the bar chart to answer the following questions.

1.

Which region contains the largest number of institutions of higher learning?

2.

What proportion of all institutions of higher learning can be found in the Southwest?

3.

Which regions each have under 5% of the total number of institutions of higher learning?

WORK IT OUT

Candy Color: Frequency and Distribution

M&Ms, Skittles, and Reese’s Pieces are all candies that have pieces that are uniformly shaped, but which have different colors. Do the colors in each bag appear with the same frequency? Get a bag of one of these candies and make a bar chart to visualize the color distribution.

Pie Charts

A pie chart consists of a circle divided into wedges, with each wedge corresponding to a category. The proportion of the area of the entire circle that each wedge represents corresponds to the proportion of the data in that category. Pie charts are difficult to make without technology because they require careful measurements of angles and precise circles, both of which are tasks better left to computers.

Video

Create Pie Charts Using Google Sheets

Pie charts are sometimes embellished with features like labels in the slices (which might be the categories, the frequencies in each category, or the proportions in each category) or a legend that explains which colors correspond to which categories. When making your own pie chart, you can decide which of those to include. The only rule is that there has to be some way to connect the slices to the categories (either through labels or a legend).

Example 8.8

Making Pie Charts

Use the data that follows to generate a pie chart.

Type	Percent	Type	Percent
SUV	43.6%	Minivan	5.5%
Sedan	33.6%	Hatchback	3.6%
Sports	10.0%	Wagon	3.6%

Table 8.1 (data source: www.consumerreports.org/cars)

Solution

First, enter the chart above into a new sheet in Google Sheets. Next, click and drag to select the full table (including the header row). Click on the “Insert” menu, then select “Chart.” The result may be a pie chart by default; if it isn’t, you can change it to a pie chart using the “Chart type” drop-down menu in the Chart Editor.

A pie chart titled, 2020 model year cars available in the US. The circle graph infers the following data. SUV: 43.6 percentage; Sedan: 33.6 percentage; Sports: 10.0 percentage; Minivan: 5.5 percentage; Hatchback: 3.6 percentage; Wagon: 3.6 percentage. — Figure 8.15 (data source: consumerreports.org/cars)

You can choose to use a legend to identify the categories, as well as label the slices with the relevant percentages.

Your Turn 8.8

1.

In Your Turn 8.6, you created a bar chart using data on reported majors from students in a class. Here are those proportions again (sorted from largest to smallest):

Major	Proportion
Biology	37.5%
Undecided	25.0%
Political Science	18.8%
Sociology	12.5%
Education	6.3%

Create a pie graph using those data.

People in Mathematics

Florence Nightingale

Florence Nightingale (1820–1910) is best remembered today for her contributions in the medical field; after witnessing the horrors of field hospitals that tended to the wounded during the Crimean War, she championed reforms that encouraged sanitary conditions in hospitals. For those efforts, she is today considered the founder of modern nursing.

Nightingale is also remembered for her contributions in statistics, especially in the ways we visualize data. She developed a version of the pie chart that is today known as a polar area diagram, which she used to visualize the causes of death among the soldiers in the war, highlighting the number of preventable deaths the British Army suffered in that conflict.

In 1859, the Royal Statistical Society honored her for her contributions to the discipline by electing her to join the organization. She was the first woman to be so honored. She was later named an honorary member of the American Statistical Association. Nightingale's status as a revered pioneer in both nursing and statistics is a complex one, because some of her writings and opinions demonstrate a colonialist mindset and disregard for those who lost their lives and lands at the hands of the British. Her core statistical writings indicated that she felt superior to the Indigenous people she was treating. Members of both fields continue to debate her near-iconic role.

Visualizing Quantitative Data

There are several good ways to visualize quantitative data. In this section, we’ll talk about two types: stem-and-leaf plots and histograms.

Stem-and-Leaf Plots

Stem-and-leaf plots are visualization tools that fall somewhere between a list of all the raw data and a graph. A stem-and-leaf plot consists of a list of stems on the left and the corresponding leaves on the right, separated by a line. The stems are the numbers that make up the data only up to the next-to-last digit, and the leaves are the final digits. There is one leaf for every data value (which means that leaves may be repeated), and the leaves should be evenly spaced across all stems. These plots are really nothing more than a fancy way of listing out all the raw data; as a result, they shouldn’t be used to visualize large datasets.

This concept can be difficult to understand without referencing an example, so let’s first look at how to read a stem-and-leaf plot.

Example 8.9

Reading a Stem-and-Leaf Plot

A collector of trading cards records the sale prices (in dollars) of a particular card on an online auction site, and puts the results in a stem-and-leaf plot:

0

5 8 9

1

0 0 0 3 4 4 5 5 5 5 6 9 9

2

0 0 0 0 5 5 9 9

3

0 0 0 5 5

4

0 0 5

5

6

0

Table 8.2

Answer the following questions about the data:

How many prices are represented?
What prices represent the five most expensive cards? The five least expensive?
What is the full set of data?

Solution

Each leaf (the numbers on the right side of the bar) represents one data value. So, on the first row (which looks like 0 | 5 8 9), there are three data values (one for each leaf: 5, 8, and 9). The next row has thirteen leaves, then eight, five, three, zero, and one. Adding those up, we get $3 + 13 + 8 + 5 + 3 + 0 + 1 = 33$ data points or prices.
The most expensive card is the last one listed. Its stem is 6 and its leaf is 0, so the price is $60. There are no leaves associated with the 5 stem, so there were no cards sold for $50 to $59. The next most expensive cards are then on the 4 stem: $45, $40, and $40 (remember, repeated leaves mean repeated values in the dataset). So, we have our four most expensive cards. The fifth would be on the next stem up. The biggest leaf on the 3 stem is a 5, so the fifth-most expensive card sold for $35.
As for the five least-expensive cards, the smallest stem is 0, with leaves 5, 8, and 9. So, the three least expensive cards sold for $5, $8, and $9 (notice that we don’t write down that leading 0 from the stem in the tens place). The next two least-expensive cards will be the two smallest leaves on the next stem: $10 and $10.
The full list of data is: 5, 8, 9, 10, 10, 10, 13, 14, 14, 15, 15, 15, 15, 16, 19, 19, 20, 20, 20, 24, 25, 25, 29, 29, 30, 30, 30, 35, 35, 40, 40, 45, 60.

Your Turn 8.9

The stem-and-leaf plot below shows data collected from a sample of employed people who were asked how far (in miles) they commute each day:

0	4 6 7
1	0 0 0 2 2 2 4 5 8 8
2	0 5 5 5
3	0 0 5 5 6
4
5	0
6	0

1.

How many data points are represented?

2.

What are the three longest and shortest commutes?

3.

What is the full list of data?

Stem-and-leaf plots are useful in that they give us a sense of the shape of the data. Are the data evenly spread out over the stems, or are some stems “heavier” with leaves? Are the heavy stems on the low side, the high side, or somewhere in the middle? These are questions about the distribution of the data, or how the data are spread out over the range of possible values.

Some words we use to describe distributions are uniform (data are equally distributed across the range), symmetric (data are bunched up in the middle, then taper off in the same way above and below the middle), left-skewed (data are bunched up at the high end or larger values, and taper off toward the low end or smaller values), and right-skewed (data are bunched up at the low end, and taper off toward the high end). See below figures.

Four histograms. The first histogram is titled, Uniform. The horizontal axis ranges from 0 to 5, in increments of 1. The vertical axis ranges from 0 to 60, in increments of 10. The histogram infers the following data. 0 to 1: 38. 1 to 2: 35. 2 to 3: 51. 3 to 4: 39. 4 to 5: 37. The second histogram is titled, Right-skewed. The horizontal axis ranges from 1 to 21, in increments of 2. The vertical axis ranges from 0 to 50, in increments of 10. The histogram infers the following data. 1 to 3: 18. 3 to 5: 40. 5 to 7: 44. 7 to 9: 37. 9 to 11: 29. 11 to 13: 14. 13 to 15: 8. 15 to 17: 5. 17 to 19: 3. 19 to 21: 2. The third histogram is titled, Left-skewed. The horizontal axis ranges from 30 to 50, in increments of 2. The vertical axis ranges from 0 to 60, in increments of 10. The histogram infers the following data. 30 to 32: 3. 32 to 34: 1. 34 to 36: 6. 36 to 38: 11. 38 to 40: 13. 40 to 42: 31. 42 to 44: 39. 44 to 46: 52. 46 to 48: 35. 48 to 50: 8. The fourth histogram is titled, Symmetric. The horizontal axis ranges from 30 to 180, in increments of 15. The vertical axis ranges from 0 to 80, in increments of 10. The histogram infers the following data. 30 to 45: 2. 45 to 60: 7. 60 to 75: 20. 75 to 90: 31. 90 to 105: 70. 105 to 120: 30. 120 to 135: 29. 135 to 150: 10. 150 to 165: 2. 165 to 180: 1. Note: all values are approximate.

Looking back at the stem-and-leaf plot in the previous example, we can see that the data are bunched up at the low end and taper off toward the high end; that set of data is right-skewed. Knowing the distribution of a set of data gives us useful information about the property that the data are measuring.

Now that we have a better idea of how to read a stem-and-leaf plot, we’re ready to create our own.

Example 8.10

Constructing a Stem-and-Leaf Plot

An entomologist studying crickets recorded the number of times different crickets (of differing species, genders, etc.) chirped in a one-minute span. The raw data are as follows:

89

97

82

102

84

99

93

103

120

91

115

105

89

109

107

89

104

82

106

92

101

109

116

103

100

91

85

104

106

Construct a stem-and-leaf plot to visualize these results.

Solution

Step 1: Before we can create the plot, we need to sort the data in order from smallest to largest:

82

84

85

89

91

92

93

97

99

100

101

102

103

104

105

106

107

109

115

116

120

Step 2: Next, we identify the stems. To do that, we cut off the final digit of each number, which leaves us with stems of 8, 9, 10, 11, and 12. Arrange the stems vertically, and add the bar to separate these from the leaves:

8

9

10

11

12

Step 3: Write down the leaves on the right side of the bar, giving just the final digit (that we cut off to make the stems) of each data value. List these in order, and make sure they line up vertically:

8

2 2 4 5 9 9 9

9

1 1 2 3 7 9

10

0 1 3 3 4 4 4 5 6 6 7 9 9

11

5 6

12

0

Table 8.3

Your Turn 8.10

1.

This table gives the records of the Major League Baseball teams at the end of the 2019 season:

Team	Wins	Losses	Team	Wins	Losses
HOU	107	55	PHI	81	81
LAD	106	56	TEX	78	84
NYY	103	59	SFG	77	85
MIN	101	61	CIN	75	87
ATL	97	65	CHW	72	89
OAK	97	65	LAA	72	90
TBR	96	66	COL	71	91
CLE	93	69	SDP	70	92
WSN	93	69	PIT	69	93
STL	91	71	SEA	68	94
MIL	89	73	TOR	67	95
NYM	86	76	KCR	59	103
ARI	85	77	MIA	57	105
BOS	84	78	BAL	54	108
CHC	84	78	DET	47	114

Table 8.4 (source: http://www.mlb.com)

Create a stem-and-leaf plot for the number of wins.

As we mentioned above, stem-and-leaf plots aren’t always going to be useful. For example, if all the data in your dataset are between 20 and 29, then you’ll just have one stem, which isn’t terribly useful. (Although there are methods like stem splitting for addressing that particular problem, we won’t go into those at this time.) On the other end of the spectrum, the data may be so spread out that every stem has only one leaf. (This problem can sometimes be addressed by rounding off the data values to the tens, hundreds, or some other place value, then using that place for the leaves.) Finally, if you have dozens or hundreds (or more) of data values, then a stem-and-leaf plot becomes too unwieldy to be useful. Fortunately, we have other tools we can use.

Histograms

Histograms are visualizations that can be used for any set of quantitative data, no matter how big or spread out. They differ from a categorical bar chart in that the horizontal axis is labeled with numbers (not ranges of numbers), and the bars are drawn so that they touch each other. The heights of the bars reflect the frequencies in each bin. Unlike with stem-and-leaf plots, we cannot recreate the original dataset from a histogram. However, histograms are easy to make with technology and are great for identifying the distribution of our data. Let’s first create one histogram without technology to help us better understand how histograms work.

Example 8.11

Constructing a Histogram

In Example 8.10, we built a stem-and-leaf plot for the number of chirps made by crickets in one minute. Here are the raw data that we used then:

89

97

82

102

84

99

115

105

89

109

107

89

101

109

116

103

100

91

93

103

120

91

85

104

82

106

92

104

106

Construct a histogram to visualize these results.

Solution

Step 1: Add data to bins. Histograms are built on binned frequency distributions, so we’ll make that first. Luckily, the stem-and-leaf plot we made earlier can help us do this much more quickly:

8

2 2 4 5 9 9 9

9

1 1 2 3 7 9

10

0 1 2 3 3 4 4 4 5 6 6 7 9 9

11

5 6

12

0

If we’re using bins of width 10, we can compute the frequencies by counting the numbers of leaves associated with the corresponding stem:

Bin	Frequency
80-89	7
90-99	6
100-109	14
110-119	2
120-129	1

(Note that, when we made binned frequency diagrams in the last module, we noted that if the biggest data value was right on the border between two bins, it was OK to lump it in with the lower bin. That’s not recommended when building histograms, so the data value 120 is all alone in the 120-129 bin.)

Step 2: Create the axes. On the horizontal axis, start labeling with the lower end of the first bin (in this case, 80), and go up to the higher end of the last bin (120). Mark off the other bin boundaries, making sure they’re all evenly spaced. On the vertical axis, start with zero and go up at least to the greatest frequency you see in your bins (14 in this example), making sure that the labels you make are evenly spaced and that the difference between those numbers is the same. Let’s count off our vertical axis by threes:

A histogram. The horizontal axis ranges from 80 to 130, in increments of 10. The vertical axis ranges from 0 to 15, in increments of 3. — Figure 8.17

Step 3: Draw in the bars. Remember that the bars of a histogram touch, and that the heights are determined by the frequency. So, the first bar will cover 80 to 90 on the horizontal axis, and have a height of 7:

A histogram shows a vertical bar. The horizontal axis ranges from 80 to 130, in increments of 10. The vertical axis ranges from 0 to 15, in increments of 3.The histogram infers the following data. 80 to 90: 7. — Figure 8.18

Now, we can fill in the others:

A bar chart plots frequency versus number of chirps. The horizontal axis ranges from 80 to 130, in increments of 10. The vertical axis ranges from 0 to 15, in increments of 3.The histogram infers the following data. 80 to 90: 7. 90 to 100: 6. 100 to 110: 14. 110 to 120: 2. 120 to 130: 1. — Figure 8.19

Step 4: Let’s compare the histogram we just created to the stem-and-leaf plot we made earlier:

A bar chart and a stem-and-leaf plot. The stem-and-leaf plot infers the following data. 8: 2, 2, 4, 5, 9, 9, 9. 9: 1, 1, 2, 3, 7, 9. 10: 0, 1, 1, 3, 3, 4, 4, 4, 5, 6, 6, 7, 9, 9. 11: 5, 6. 12: 0. The horizontal axis of the histogram ranges from 80 to 130, in increments of 10. The vertical axis ranges from 0 to 15, in increments of 3.The histogram infers the following data. 80 to 90: 7. 90 to 100: 6. 100 to 110: 14. 110 to 120: 2. 120 to 130: 1. — Figure 8.20

Notice that the leaves on the rotated stem-and-leaf plot match the bars on our histogram! We can view stem-and-leaf plots as sideways histograms. But, as we’ll see soon, we can do much more with histograms.

Your Turn 8.11

1.

In Your Turn 10, you made a stem-and-leaf plot of the number of wins for each MLB team in 2019, using this set of data:

Team	Wins	Losses
HOU	107	55
LAD	106	56
NYY	103	59
MIN	101	61
ATL	97	65
OAK	97	65
TBR	96	66
CLE	93	69
WSN	93	69
STL	91	71
MIL	89	73
NYM	86	76
ARI	85	77
BOS	84	78
CHC	84	78
PHI	81	81
TEX	78	84
SFG	77	85
CIN	75	87
CHW	72	89
LAA	72	90
COL	71	91
SDP	70	92
PIT	69	93
SEA	68	94
TOR	67	95
KCR	59	103
MIA	57	105
BAL	54	108
DET	47	114

Table 8.5

Create a histogram for the number of wins. Use bins of width 10, starting with a bin for 40-49 (so that your histogram reflects the stem-and-leaf plot you made earlier).

Now that we’ve seen the connection between stem-and-leaf plots and histograms, we are ready to look at how we can use Google Sheets to build histograms.

Video

Make a Histogram Using Google Sheets

Let’s use Google Sheets to create a histogram for a large dataset.

Example 8.12

Creating a Histogram in Google Sheets

The data in “AvgSAT” contains the average SAT score for students attending every institution of higher learning in the US for which data is available. Create a histogram in Google Sheets of the average SAT scores. Use bins of width 50. Are the data uniformly distributed, symmetric, left-skewed, or right-skewed?

Solution

Using the procedure described in the video above, we get this:

A histogram titled, average SAT scores at US institutions. The horizontal axis representing average SAT ranges from 750 to 1600, in increments of 50. The vertical axis representing frequency ranges from 0 to 300, in increments of 100. The histogram infers the following data. 750 to 800: 4. 800 to 850: 5. 850 to 900: 15. 900to 950: 25. 950 to 1000: 85. 1000 to 1050: 170. 1050 to 1100: 230. 1100 to 1150: 275. 1150 to 1200: 275. 1200 to 1250: 110. 1250 to 1300: 70. 1300 to 1350: 50. 1350 to 1400: 40. 1400 to 1450: 40. 1450 to 1500: 20. 1500 to 1550: 20. 1550 to 1600: 5. Note: all values are approximate. — Figure 8.21 (data source: https://data.ed.gov)

The data are fairly symmetric, but slightly right-skewed.

Your Turn 8.12

1.

The file “InState” contains in-state tuition costs (in dollars) for every institution of higher learning in the United States for which data is available (data from data.ed.gov). Create a histogram in Google Sheets of in-state tuition costs. Choose a bin size that you think works well. Are the data uniformly distributed, symmetric, left-skewed, or right-skewed?

Bar Charts for Labeled Data

Sometimes we have quantitative data where each value is labeled according to the source of the data. For example, in the Your Turn above, you looked at in-state tuition data. Every value you used to create that histogram was associated with a school; the schools are the labels. In YOUR TURN 8.11, you found a histogram of the wins of every Major League Baseball team in 2019. Each of those win totals had a label: the team. If we’re interested in visualizing differences among the different teams, or schools, or whatever the labels are, we create a different version of the bar graph known as a bar chart for labeled data.

These graphs are made in Google Sheets in exactly the same way as regular bar graphs. The only change is that the vertical axis will be labeled with the units for your quantitative data instead of just “Frequency.”

Example 8.13

Building a Bar Chart for Labeled Data

The following table shows the gross domestic product (GDP) for the United States for the years 2010 to 2019:

Year	GDP (in $ trillions)	Year	GDP (in $ trillions)
2010	14.992	2015	18.225
2011	15.543	2016	18.715
2012	16.197	2017	19.519
2013	16.785	2018	20.580
2014	17.527	2019	21.433

Table 8.6 (source: https://data.worldbank.org)

Construct a histogram that represents these data.

Solution

In this case, the years are the labels, and the data we are interested in are the GDP numbers. Once you have the table above (including the labels) entered into a spreadsheet, click and drag to select the full table. Then, in the “Insert” menu, click “Chart.” The result may not be a bar chart; if it’s not, select “Column chart” in the drop-down menu “Chart type” in the Chart Editor. If you want, you can edit things like the chart title in the “Customize” tab in the Chart Editor.

A bar graph titled, Unites States GDP, 2010 to 2019. The horizontal axis representing the year ranges from 2010 to 2019, in increments of 1. The vertical axis represents GDP in trillion dollars. The bar graph infers the following data. 2010: 15. 2011: 15.5. 2012: 16. 2013: 17. 2014: 17.5. 2015: 18. 2016: 19. 2017: 19.5. 2018: 20.5. 2019: 21. Note: all values are approximate. — Figure 8.22 (data source: https://data.worldbank.org)

Your Turn 8.13

1.

The following table shows the world record times (as of February 2020) of the various 100m women’s swimming events in international competition:

Event	Time	Name	Nationality
Freestyle	51.71	Sarah Sjöström	Sweden
Backstroke	57.57	Regan Smith	United States
Breaststroke	64.10	Lilly King	United States
Butterfly	55.48	Sarah Sjöström	Sweden

Table 8.7 (source: https://swimswam.com/records/womens-world-records-lcm/)

Make a visualization of these times using the events as the labels.

Misleading Graphs

Graphical representations of data can be manipulated in ways that intentionally mislead the reader. There are two primary ways this can be done: by manipulating the scales on the axes and by manipulating or misrepresenting areas of bars. Let’s look at some examples of these.

Example 8.14

Misleading Graphs

The table below shows the teams, and their payrolls, in the English Premier League, the top soccer organization in the United Kingdom.

Team	Salary (£1,000,000s)	Team	Salary (£1,000,000s)
Manchester United F.C.	175.7	Newcastle United F.C.	56.9
Manchester City F.C.	136.5	Aston Villa F.C.	52.3
Chelsea F.C.	132.8	Fulham F.C.	52.1
Arsenal F.C.	130.7	Southampton F.C.	49.6
Tottenham Hotspur F.C.	129.2	Wolverhampton Wanderers F.C.	49.5
Liverpool F.C.	118.6	Brighton & Hove Albion	43.7
Crystal Palace	85.0	Burnley F.C.	35.5
Everton F.C.	82.5	West Bromwich Albion F.C.	23.8
Leicester City	73.7	Leeds United F.C.	22.5
West Ham United F.C.	69.2	Sheffield United F.C.	19.7

Table 8.8 (source: www.spotrac.com)

How might someone present this data in a misleading way?

Solution

Step 1: Let’s focus on the top five teams. Here’s a bar chart of their payrolls:

A bar chart titled, top five EPL teams by payroll (1,000,000 pounds), January 2020. The horizontal axis represents teams. The vertical axis representing payroll (1,000,000 pounds) ranges from 0 to 200, in increments of 50. The bar graph infers the following data. Tottenham Hotspur: 130. Arsenal F.C.: 131. Chelsea F.C.: 133. Manchester City: 137. Manchester United: 176. Note: all values are approximate. — Figure 8.23 (data source: www.spotrac.com)

Step 2: Now, here’s another bar chart visualizing exactly the same data:

A bar graph titled, top five EPL teams by payroll (1,000,000 pounds), January 2020. The horizontal axis represents teams. The vertical axis representing payroll (1,000,000 pounds) ranges from 120 to 180, in increments of 20. The bar graph infers the following data. Tottenham Hotspur: 130. Arsenal F.C.: 131. Chelsea F.C.: 133. Manchester City: 137. Manchester United: 176. Note: all values are approximate. — Figure 8.24 (data source: www.spotrac.com)

Step 3: You should notice that despite using the same data, these two graphs look strikingly different. In the second graph, the gap between Manchester United and the other four teams looks significantly larger than in the first graph. The scale on the vertical axis has been manipulated here. The first graph's axis starts at zero, while the lowest value on the second graph's axis is 120. This trick has a strong impact on the viewer’s perception of the data.

Checkpoint

Beware of vertical axes that don’t start at zero! They overemphasize differences in heights.

Step 4: To further emphasize the difference this creates in our perception, let's look at that data again, but this time using graphics instead of colored areas on our bar graph.

A bar graph titled, top five EPL teams by payroll (1,000,000 pounds), January 2020. Each vertical bar is represented by a 10 dollar bill. The horizontal axis represents teams. The vertical axis representing payroll (1,000,000 pounds) ranges from 120 to 180, in increments of 20. The bar graph infers the following data. Tottenham Hotspur: 130. Arsenal F.C.: 131. Chelsea F.C.: 133. Manchester City: 137. Manchester United: 176. Note: all values are approximate. — Figure 8.25 (data source: www.spotrac.com)

This graph uses an image of a £10 banknote in place of the bars. Using an image that evokes the context of the data in place of a standard, “boring” bar is a common tool that people use when creating infographics. However, this is generally not a good practice because it distorts the data. Notice that our “bars” (the banknotes) are just as tall here as they were in the previous figure. But, to maintain the right proportions, the widths had to be adjusted as well, which changes the area (height × width) of each bar. A key point is that when looking at rectangles, the human eye tends to process areas more easily than heights.

Checkpoint

Beware of infographics! Areas overemphasize a difference that should be measured with a height!

Step 5: Now, let’s look at all 20 teams. This histogram indicates that the data are right-skewed, with the highest number of teams having a payroll between £40 million and £80 million:

A histogram titled, total payrolls teams in the EPL (1,000,000 pounds), January 2020. The horizontal axis representing payroll (1,000,000 pounds) ranges from 0 to 200, in increments of 40. The vertical axis representing frequency ranges from 0 to 8, in increments of 2. The histogram infers the following data. 0 to 40: 4. 40 to 80: 8. 8 to 120: 3. 120 to 160: 4. 160 to 200: 1. — Figure 8.26 (data source: www.spotrac.com)

Step 6: Now let's view this same data in another chart:

A histogram titled frequency versus payrolls (1,000,000 pounds). The horizontal axis represents payroll (1,000,000 pounds) ranges from 0 to 80, in increments of 40. The vertical axis representing frequency ranges from 0 to 8, in increments of 2. The histogram infers the following data. 0 to 40: 4. 40 to 80: 8. Over 80: 8. — Figure 8.27 (data source: www.spotrac.com)

Step 7: Even though this chart uses the same data, the skew seems to be reversed. Why? Well, even though this graph looks like a histogram, it isn’t. Look closely at the labels on the horizontal axis; they don't correspond to spots on the axis, but instead provide a range, meaning this is a bar graph based on a binned frequency distribution.
When we review these ranges, we can see that the last range is misleading as it consists of all data “over 80.” If the bins all had the same width, that last bin would run from 80 to 120. However, we can see from the histogram that the maximum value for this data is between 160 and 200. If the last bin in this bar graph were labeled honestly, it would read “80–200,” which would drive home the fact that the width of that bar is misleading.

Checkpoint

Always check the horizontal axis on histograms! The widths of all the bars should be equal.

Your Turn 8.14

1.

Take a look again at the win totals for teams in Major League Baseball in 2019 :

Team	Wins	Team	Wins
HOU	107	PHI	81
LAD	106	TEX	78
NYY	103	SFG	77
MIN	101	CIN	75
ATL	97	CHW	72
OAK	97	LAA	72
TBR	96	COL	71
CLE	93	SDP	70
WSN	93	PIT	69
STL	91	SEA	68
MIL	89	TOR	67
NYM	86	KCR	59
ARI	85	MIA	57
BOS	84	BAL	54
CHC	84	DET	47

Table 8.9 (source: https://www.espn.com/mlb/standings/_/season/2019/view)

Make one good and one misleading chart showing the number of wins by the top ten teams. Then, looking at all the teams, make one good and one misleading histogram for the win totals.

Video

How to Spot a Misleading Graph

Who Knew?

Napoleon's Failed Invasion

One of the most famous data visualizations ever created is the cartographic depiction by Charles Joseph Minard of Napoleon’s disastrous attempted invasion of Russia.

A portrait of Minard’s Napoleon Map. — Figure 8.28 Minard’s Napoleon Map (credit: Carte de Charles Minard/Wikimedia, public domain)

Minard’s chart is remarkable in that it shows not just how the size of Napoleon’s army shrank drastically over time, but also the location on the map, the direction the army was traveling at the time, and the temperature during the retreat.

Check Your Understanding

The medical office at a zoo tracks the animals it treats each week. The table shows the classifications for a particular week:

Mammal	Mammal	Reptile	Bird	Mammal	Amphibian
Mammal	Mammal	Mammal	Reptile	Mammal	Bird
Mammal	Bird	Reptile	Reptile	Amphibian	Mammal
Bird	Mammal	Amphibian	Mammal	Mammal	Bird

7.

Create a bar graph of the data without technology.

8.

Create a pie chart of the data using technology.

Employees at a college help desk track the number of people who request assistance each week. The table gives a sample of the results :

142	153	158	156	141	143
139	158	156	146	137	153
136	127	157	148	132	139
155	167	143	168	133	157
138	156	164	130	148	136

9.

Make a stem-and-leaf plot of the data.

10.

Create a histogram of the data. Use bins of width 5.

The following are data on the admission rates of the different branch campuses in the University of California system, along with the out-of-state tuition and fee cost:

Campus	Admission Rate	Cost ($)
Berkeley	0.1484	43,176
Davis	0.4107	43,394
Irvine	0.2876	42,692
Los Angeles	0.1404	42,218
Merced	0.6617	42,530
Riverside	0.5057	42,819
San Diego	0.3006	43,159
Santa Barbara	0.322	43,383
Santa Cruz	0.4737	42,952

(source: https://data.ed.gov)

11.

Create a bar graph that illustrates the differences in admission rates among the different campuses.

12.

Create two bar graphs for the out-of-state tuition. One should give an unbiased perception of the differences among them, and the other should overemphasize those differences.

Section 8.2 Exercises

The table below shows the answers to the question, “Which social media platform, if any, do you use most frequently?”

None	Twitter	Snapchat	Snapchat	Twitter	Facebook
Instagram	Snapchat	Twitter	None	Snapchat	Instagram
Instagram	Facebook	None	Instagram	Snapchat	Twitter
Snapchat	Instagram	Instagram	Twitter	Snapchat	Twitter
Facebook	None	Instagram	Instagram	Twitter	Instagram

1 .

Make a bar chart to visualize these responses.

2 .

Make a pie chart to visualize these responses.

A sample of students at a large university were asked whether they were full-time students living on campus (Full-Time Residential, FTR), full-time students who commuted (FTC), or part-time students (PT). The raw data are in the table below:

FTR	FTR	FTC	PT	FTR	PT	FTR	FTC
FTR	FTC	FTC	FTR	FTR	PT	FTC	FTC
FTC	PT	FTC	FTC	PT	FTR	FTC	PT
FTC	PT	FTR	PT	FTC	FTC	FTR	PT

3 .

Make a bar chart to visualize these responses.

4 .

Make a pie chart to visualize these responses.

Students in a statistics class were asked how many countries (besides their home countries) they had visited; the table below gives the raw responses:

2	1	1	3	2	0	2	0	1
2	0	1	0	1	1	0	1	0
0	0	2	1	1	0	1	1	0

5 .

Create a bar graph visualizing these data (treating the responses as categorical).

6 .

Create a pie chart visualizing these data.

The purchasing department for a chain of bookstores wants to make sure they’re buying the right types of books to put on the shelves, so they take a sample of 20 books that customers bought in the last five days and record the genres:

Nonfiction	Young Adult	Romance	Cooking	Young Adult
Young Adult	Thriller	Young Adult	Nonfiction	True Crime
Romance	Nonfiction	Thriller	True Crime	Romance
True Crime	Thriller	Romance	Young Adult	Young Adult

7 .

Create a bar graph to visualize these data.

8 .

Create a pie chart to visualize these data.

An elementary school class is administered a standardized test for which scores range from 0 to 100, as shown below:

60	54	71	80	63
72	70	88	88	67
74	79	50	99	64
98	55	64	86	92
72	65	88	80	65

(source: http://www.nwslsoccer.com)

9 .

Make a stem-and-leaf plot to visualize these results.

10 .

Make a histogram to visualize these results. Use bins of width 10.

The following table gives the final results for the 2021 National Women’s Soccer League season. The columns are standings points (PTS; teams earn three points for a win and one point for a tie), wins (W), losses (L), ties (T), goals scored by that team (GF), and goals scored against that team (GA).

Team	PTS	W	L	T	GF	GA
Portland Thorns FC	44	13	6	5	33	17
OL Reign	42	13	8	3	37	24
Washington Spirit	39	11	7	6	29	26
Chicago Red Stars	38	11	8	5	28	28
NJ/NY Gotham FC	35	8	5	11	29	21
North Carolina Courage	33	9	9	6	28	23
Houston Dash	32	9	10	5	31	31
Orlando Pride	28	7	10	7	27	32
Racing Louisville FC	22	5	12	7	21	40
Kansas City Current	16	3	14	7	15	36

(source: http://www.nwslsoccer.com)

11 .

Make a stem-and-leaf plot for PTS.

12 .

Make a histogram for PTS, using bins of width 5.

13 .

Make a histogram for GF, using bins of width 5.

14 .

Make a histogram for GA, using bins of width 5.

For the following exercises, use the "CUNY" dataset–which gives the location (borough) of each college in the City University of New York (CUNY) system, the highest degree offered, and the proportions of total degrees awarded in a partial list of disciplines–to identify the right visualization to address each question. Then, create those visualizations.

15 .

What is the highest degree offered in colleges across the CUNY system?

16 .

What is the distribution of the proportion of degrees awarded in Information Science across the CUNY system?

17 .

In which boroughs are the CUNY colleges located?

18 .

What are the proportions of degrees awarded across the listed humanities fields (Foreign Language, English, Humanities, Philosophy & Religion, History) at City College?

19 .

What proportions of degrees are awarded in Social Service at the different institutions located in Manhattan?

For the following exercises, use the data found in the "Receivers" dataset on the top 25 receivers (by number of receptions; data collected from pro-football-reference.com) in the NFL during the 2020 season.

20 .

Make a stem-and-leaf plot for the longest receptions (“Long”).

21 .

Make a stem-and-leaf plot for receptions.

22 .

Make a histogram for yards.

23 .

Make a histogram for yards per reception (“Yds/Rec”).

24 .

Make a histogram for the longest receptions (“Long”).

25 .

Make a histogram for receptions.

26 .

Make a histogram for age.

27 .

Describe the distribution of age as left-skewed, symmetric, or right-skewed.

28 .

Describe the distribution of receptions as left-skewed, symmetric, or right-skewed.

29 .

Describe the distribution of yards as left-skewed, symmetric, or right-skewed.

30 .

Describe the distribution of touchdowns (“TD”) as left-skewed, symmetric, or right-skewed.

31 .

Describe the distribution of longest receptions as left-skewed, symmetric, or right-skewed.