Contemporary Mathematics

# 8.1Gathering and Organizing Data

Contemporary Mathematics8.1 Gathering and Organizing Data

Figure 8.2 Surveys are commonly used to gather data. (credit: “survey” by Donnell King/Flickr, CC0 1.0 Public Domain)

## Learning Objectives

After completing this section, you should be able to:

1. Distinguish among sampling techniques.
2. Organize data using an appropriate method.
3. Create frequency distributions.

When a polling organization wants to try to establish which candidate will win an upcoming election, the first steps are to write questions for the survey and to choose which people will be asked to respond to the survey. These can seem like simple steps, but they have far-reaching implications in the analysis the pollsters will later carry out. The process by which samples (or groups of units from which we collect data) are chosen can strongly affect the data that are collected. Units are anything that can be measured or surveyed (such as people, animals, objectives, or experiments) and data are observations made on units.

One of the most famous failures of good sampling occurred in the first half of the twentieth century. The Literary Digest was among the most respected magazines of the early twentieth century. Despite the name, the Digest was a weekly newsmagazine. Starting in 1916, the Digest conducted a poll to try to predict the winner of each US Presidential election. For the most part, their results were good; they correctly predicted the outcome of all five elections between 1916 and 1932. In 1936, the incumbent President Franklin Delano Roosevelt faced Kansas governor Alf Landon, and once again the Digest ran their famous poll, with results published the week before the election. Their conclusion? Landon would win in a landslide, 57% to 43%. Once the actual votes had been counted, though, Roosevelt ended up with 61% of the popular vote, 18% more than the poll predicted. What went wrong?

The short answer is that the people who were chosen to receive the survey (over ten million of them!) were not a good representation of the population of voting adults. The sample was chosen using the Digest's own base of subscribers as well as publicly available lists of people that were likely adults (and therefore eligible to vote), mostly phone books and vehicle registration records. The pollsters then mailed every single person on these lists a survey. Around a quarter of those surveys were returned; this constituted the sample that was used to make the Digest’s disastrously incorrect prediction. However, the Digest made an error in failing to consider that the election was happening during the Great Depression, and only the wealthy had disposable income to spend on telephone lines, automobiles, and magazine subscriptions. Thus, only the wealthy were sent the Digest’s survey. Since Roosevelt was extremely popular among poorer voters, many of Roosevelt’s supporters were excluded from the Digest’s sample.

Another more complicated factor was the low response rate; only around 25% of the surveys were returned. This created what’s called a non-response bias.

## Sampling and Gathering Data

The Digest's failure highlights the need for what is now considered the most important criterion for sampling: randomness. This randomness can be achieved in several ways. Here we cover some of the most common.

A simple random sample is chosen in a way that every unit in the population has an equal chance of being selected, and the chances of a unit being selected do not depend on the units already chosen. An example of this is choosing a group of people by drawing names out of a hat (assuming the names are well-mixed in the hat).

A systematic random sample is selected from an ordered list of the population (for example, names sorted alphabetically or students listed by student ID). First, we decide what proportion of the population will be in our sample. We want to express that proportion as a fraction with 1 in the numerator. Let’s call that number D. Next, we’ll choose a random number between one and D. The unit at that position will go into our sample. We’ll find the rest of our sample by choosing every Dth unit in the list, starting with our random number.

To walk through an example, let’s say we want to sample 2% of the population: $2%=2100=1502%=2100=150$. (Note: If the number in the denominator isn’t a whole number, we can just round it off. This part of the process doesn’t have to be precise.) We can then use a random number generator to find a random number between 1 and 50; let's use 31. In our example, our sample would then be the units in the list at positions 31, 81 (31 + 50), 131 (81 + 50), and so forth.

A stratified sample is one chosen so that particular groups in the population are certain to be represented. Let’s say you are studying the population of students in a large high school (where the grades run from 9th to 12th), and you want to choose a sample of 12 students. If you use a simple or systematic random sample, there’s a pretty good chance that you’ll miss one grade completely. In a stratified sample, you would first divide the population into groups (the strata), then take a random sample within each stratum (that’s the singular form of “strata”). In the high school example, we could divide the population into grades, then take a random sample of three students within each grade. That would get us to the 12 students we need while ensuring coverage of each grade.

A cluster sample is a sample where clusters of units are chosen at random, instead of choosing individual units. For example, if we need a sample of college students, we may take a list of all the course sections being offered at the college, choose three of them at random (the sections are the clusters), and then survey all the students in those sections. A sample like this one has the advantage of convenience: If the survey needs to be administered in person, many of your sample units will be located in one place at the same time.

## Example 8.1

### Random Sampling

For each of the following situations, identify whether the sample is a simple random sample, a systematic random sample, a stratified random sample, a cluster random sample, or none of these.

1. A postal inspector wants to check on the performance of a new mail carrier, so she chooses four streets at random among those that the carrier serves. Each household on the selected streets receives a survey.
2. A hospital wants to survey past patients to see if they were satisfied with the care they received. The administrator sorts the patients into groups based on the department of the hospital where they were treated (ICU, pediatrics, or general), and selects patients at random from each of those groups.
3. A quality control engineer at a factory that makes smartphones wants to figure out the proportion of devices that are faulty before they are shipped out. The phones are currently packed in boxes for shipping, each of which holds 20 devices. The engineer wants to sample 100 phones, so he selects five crates at random and tests every phone in those five crates.
4. A newspaper reporter wants to write a story on public perceptions on a project that will widen a congested street. She stands on the side of the street in question and interviews the first five people she sees there.
5. An executive at a streaming video service wants to know if her subscribers would support a second season of a new show. She gets a list of all the subscribers who have watched at least one episode of the show, and uses a random number generator to select a sample of 50 people from the list.
6. An agent for a state’s Department of Revenue is in charge of selecting 100 tax returns for audit. He has a list of all of the returns eligible for audit (about 12,000 in all), sorted by the taxpayer’s ID number. He asks a computer to give him a random number between 1 and 120; it gives him 15. The agent chooses the 15th, 135th, 255th, 375th, and every 120th return after that to be audited.

For each of the following situations, identify whether the sample is a simple random sample, a systematic random sample, a stratified random sample, a cluster random sample, or none of these.
1.
The chairperson of the University Chess Club is trying to decide on a time for the club’s regular meetings, so she emails all of the members of the club to find their preferences.
2.
The registrar at a small college wants to use a survey to determine if their office could do a better job of serving students. They choose three students at random from each major to take the survey.
3.
A civic club is organizing a raffle as a fundraiser. To determine the three winners, each of the tickets is put into a large drum, then the tickets are thoroughly mixed. A blindfolded club member pulls three tickets out of the drum.

## People in Mathematics

### George Gallup

Figure 8.3 George Gallup was a founder of survey sampling techniques, and his legacy lives on to this day. (credit: "George Gallup at the National Press Club, Washington, D.C., 1969" by Bernard Gotfryd/Library of Congress Prints & Photographs Division, public domain)

George Gallup (1901–1984) rose to fame in 1936 when his prediction of the percentage of the vote going to each candidate in that year’s U.S. Presidential election was more accurate than the one published in Literary Digest, and he did so using a sample that was much smaller than the Digest. He even took it one step farther, predicting with high accuracy the erroneous results of the poll that the Literary Digest would end up publishing! Gallup’s theories on public opinion polling essentially created that field. In 1948, Gallup’s reputation took a bit of a hit, when he famously, but incorrectly, predicted that Thomas Dewey would beat incumbent Harry Truman in that year’s Presidential election. Over the following decades, however, public trust in Gallup’s polls recovered and even steadily increased. The company Gallup founded continues to conduct daily public opinion polls, as well as provides consulting services for businesses.

## Organizing Data

Once data have been collected, we turn our attention to analysis. Before we analyze, though, it’s useful to reorganize the data into a format that makes the analysis easier. For example, if our data were collected using a paper survey, our raw data are all broken down by respondent (represented by an individual response sheet). To perform an analysis on all the responses to an individual question, we need to first group all the responses to each question together. The way we organize the data depends on the type of data we’ve collected.

There are two broad types of data: categorical and quantitative. Categorical data classifies the unit into a group (or category). Examples of categorical data include a response to a yes-or-no question, or the color of a person’s eyes. Quantitative data is a numerical measure of a property of a unit. Examples of quantitative data include the time it takes for a rat to run through a maze or a person’s daily calorie intake. We’ll look at each type of data in turn when considering how best to organize.

### Categorical Data Organization

The best way to organize categorical data is using a categorical frequency distribution. A categorical frequency distribution is a table with two columns. The first contains all the categories present in the data, each listed once. The second contains the frequencies of each category, which are just a count of how often each category appears in the data.

## Example 8.2

### Creating a Categorical Frequency Distribution

A teacher records the responses of the class (28 students) on the first question of a multiple choice quiz, with five possible responses (A, B, C, D, and E):

 A A C A B B A E A C A A A C E A B A A C A B E E A A C C

Create a categorical frequency distribution that organizes the responses.

1.
Students in a statistics class who were asked to provide their majors provided the data below:
 Undecided Biology Biology Sociology Political Science Sociology Undecided Undecided Undecided Biology Biology Education Biology Biology Political Science Political Science
Create a categorical frequency distribution to organize these responses.

### Quantitative Data

We have a couple of options available for organizing quantitative data. If there are just a few possible responses, we can create a frequency distribution just like the ones we made for categorical data above. For example, if we’re surveying a group of high school students and we ask for each student’s age, we’ll likely only get whole-number responses between 13 and 19. Since there are only around seven (and likely fewer) possible responses, we can treat the data as if they’re categorical and create a frequency distribution as before.

## Example 8.3

### Creating a Quantitative Frequency Distribution

Attendees of a conflict resolution workshop are asked how many siblings they have. The responses are as follows:

 1 0 1 1 2 0 3 1 1 4 1 2 0 1 3 1 2 1 2 4 1 0 1 3 0 1 2 2 1 5

Create a frequency distribution to organize the responses.

1.
A question on a community survey asked each respondent to give the number of people who shared their residence, and the data from the responses was as follows:
 1 3 2 2 1 3 3 4 2 2 2 4 1 1 2 3 1 1 5 2 1 4 3 2 1 2 2 1 3 1 3 3 4 1 4 2 2 2 1 4
Create a frequency distribution to organize the responses.

If there are many possible responses, a frequency distribution table like the ones we’ve seen so far isn’t really useful; there will likely be many responses with a frequency of one, which means the table will be no better than looking at the raw data. In these cases, we can create a binned frequency distribution. A binned frequency distribution groups the data into ranges of values called bins, then records the number of responses in each bin.

For example, if we have height data for individuals measured in centimeters, we might create bins like 150–155 cm, 155–160 cm, and so forth (making sure that every data value falls into a bin). We must be careful, though; in this scenario, it’s not clear which bin would contain a response of 155 cm. Usually, responses on the edge of a bin are placed in the higher bin, but it’s good practice to make that clear. In cases where responses are rounded off, you can avoid this issue by leaving a gap between the bins that couldn’t contain any responses. In our example, if the measurements were all rounded off to the nearest centimeter, we could make bins like 150–154 cm, 155–159 cm, etc. (since a response like 154.2 isn’t possible). We’ll use this method going forward. How do we decide what the boundaries of our bins should be? There’s no one right way to do that, but there are some guidelines that can be helpful.

1. Every data value should fall into exactly one bin. For example, if the lowest value in our data is 42, the lowest bin should not be 45–49.
2. Every bin should have the same width. Note that if we shift the upper limits of our bins down a bit to avoid ambiguity (like described above), we can’t simply subtract the lower limit from the upper limit to get the bin width; instead, we subtract the lower limit of the bin from the lower limit of the next bin. For example, if we’re looking at GPAs rounded to the nearest hundredth, we might choose bins like 2.00–2.24, 2.25–2.49, 2.50–2.74, etc. These bins all have a width of 0.25.
3. If the minimum or maximum value of the data falls right on the boundary between two bins, then it’s OK to bend the rule just a little in order to avoid having an additional bin containing just that one value. We’ll see an example of this in just a moment.
4. If we have too many or too few bins, it can be difficult to get a good sense of the distribution. Seven or eight bins is ideal, but that’s not a firm rule; anything between five and twelve is fine. We often choose the number of bins so that the widths are round numbers.

## Example 8.4

### Creating a Binned Frequency Distribution

The GPAs of students enrolled in an advanced sociology class are listed in the following table. At this institution, 4.00 is the maximum possible GPA.

 3.93 3.43 2.87 2.51 2.7 1.91 2.32 2.85 3.06 3.03 3.49 1.84 3.72 2.56 1.99 3.4 3.74 3.23 1.98 3.05 1.43 2.9 1.2 3.72 3.56 3.07 2.58 4 2.79 3.81 2.6 3.69 2.88 3.34 1.51 3.63 3.45 1.89 2.3 2.98 3.04 2.7

Create a binned frequency distribution for the data.

1.
The following table displays the ages of a sample of customers who have shopped at a new boutique.
 56 39 35 32 26 53 55 47 70 43 33 33 43 41 26 40 31 34 33 53
Create a binned frequency distribution to summarize these data.

For the following problems, decide whether randomization is being used in the selection of these samples. If it is, identify the type of random sample (simple, systematic, cluster, or stratified).
1.
High school guidance counselors want to know the proportion of the school’s seniors who intend to apply for college. They choose four senior homerooms at random, then visit each one and ask every student in those homerooms whether they intend to apply.
2.
A quality control technician wants to ensure that the sandals being made in his factory are up to specifications, so they check the first five pairs they see coming off the line.
3.
A college athletic department wants to check up on the mental wellness of its student-athletes. The department wants to ensure every varsity sport is represented, so they survey three randomly selected members of each team.
4.

The purchasing manager for a chain of bookstores wants to make sure they’re buying the right types of books to put on the shelves, so they take a sample of 20 books that customers bought in the last five days and record the genres. Use the raw data below to create a categorical frequency distribution.

5.
A survey of college students asked how many courses those students were currently taking. Create a quantitative frequency distribution to summarize the raw data given below:
 3 4 4 3 5 4 4 3 2 3 5 5 3 3 4 3 2 4 3 3 4 3 5 3 3 3 2 3 1 3 4 3
6.
The World Bank provides data on every country in the world. The following is a sample of twenty-five countries, along with the number of cell phone subscriptions registered in that country per hundred residents. Create a binned frequency distribution for the cell phone data.
Country Cell Country Cell
Cameroon 83.7 Benin 78.5
Vanuatu 82.5 Eritrea 13.7
Georgia 140.7 Mauritania 92.2
Kazakhstan 146.6 Czech Republic 119
Bermuda 105.9 Qatar 151.1
Russia 157.9 Pakistan 73.4
Hungary 113.5 Egypt 105.5
Costa Rica 180.2 Nepal 123.2
Algeria 111 Turkey 96.4
Somalia 48.3 Congo 43.5
Fiji 114.2 Venezuela 78.5
Angola 44.7
(source)

## Section 8.1 Exercises

For the following exercises, data are collected on a sample of items found in a grocery store. Classify each of these datasets obtained from that sample as being categorical or quantitative.
1 .
Price
2 .
Calories per serving
3 .
Whether the product is gluten-free
4 .
Package weight
5 .
Country of origin
For the following exercises, decide whether random samples are being selected. If they are, decide whether they are simple, systematic, cluster, or stratified.
6 .
7 .
An electronics retailer uses a computer to randomly select customers in its rewards club to take a survey about their interest in a new product.
8 .
The student affairs office at a university wants to make sure students who live on campus are satisfied with their access to laundry facilities. They select five students at random from each residence hall to take the survey.
9 .
A professor wants to gauge how much time her students spend on homework, so she asks that question of each student who comes to her office hours that day.
10 .
The management at a restaurant wants feedback about its new menu. They choose ten tables at random, and survey each person seated at that table.
11 .
The transit authority in a large city wants to know about usage on a particular train route. They choose a number between 1 and 5 at random, and get 4. They then count the number of people on the fourth train to pass through the station, and then count every fifth train after that.
12 .
A candidate for a seat in the U.S. Congress wants to learn which issues are most important to her potential constituents. She chooses 50 people at random from each zip code in her district to survey.
For the following exercises, you have been tasked with surveying a sample of 100 registered voters who live in your town. You have access to a spreadsheet containing the following data on every registered voter: name, address, age, phone number. The spreadsheet also can generate a unique random number for each person.
13 .
Describe how you might choose a simple random sample from this population.
14 .
Describe how you might choose a stratified random sample from this population to ensure that all age groups are represented.
15 .
Assume that there are 50,000 registered votes on your list. Describe how you might choose a systematic random sample from this population.
16 .
A sample of students was asked, “Which social media platform, if any, do you use most frequently?” The raw responses are given in this table:

Create a categorical frequency distribution to summarize these data.

17 .
A sample of students at a large university were asked whether they were full-time students living on campus (Full-Time Residential, FTR), full-time students who commuted (FTC), or part-time students (PT). The raw data are in the table below:
 FTR FTR FTC PT FTR PT FTR FTC FTC PT FTC FTC PT FTR FTC PT FTR FTC FTC FTR FTR PT FTC FTC FTC PT FTR PT FTC FTC FTR PT

Give the categorical frequency distribution for these data.

18 .
A survey of students in a math class asked for the respondents’ birth months. The table below lists the responses:
 Dec Feb Apr Sep Nov Dec Aug Feb Feb Sep Oct Feb Jun Jan Jul May May Jan Mar Feb Nov Oct Apr Oct Aug Jan May Jan

Give the categorical frequency distribution of the birth months.

19 .
Students in a statistics class were asked how many countries (besides their home countries) they had visited. The table below gives the raw responses:
 0 2 1 1 3 2 0 0 0 2 1 1 0 1 1 0 2 0 1 0 1 0 2 0 1 1 0 0 1 0

Create a frequency distribution to summarize the data.

20 .
The following table contains the top 25 receivers (by number of receptions) in the NFL during the 2020 season, along with their teams and the number of fumbles each made over the course of the season:
Player Team Fumbles Player Team Fumbles
Stefon Diggs BUF 0 Calvin Ridley ATL 1
Davante Adams GNB 1 Robert Woods LAR 2
DeAndre Hopkins ARI 3 Justin Jefferson MIN 1
Darren Waller LVR 2 Diontae Johnson PIT 2
Travis Kelce KAN 1 Tyreek Hill KAN 1
Allen Robinson CHI 0 Terry McLaurin WAS 1
Keenan Allen LAC 3 Alvin Kamara NOR 1
Tyler Lockett SEA 1 D.K. Metcalf SEA 1
JuJu Smith-Schuster PIT 3 Cole Beasley BUF 0
Robby Anderson CAR 1 Brandin Cooks HOU 0
Amari Cooper DAL 0 J.D. McKissic WAS 3
Cooper Kupp LAR 1 Tyler Boyd CIN 1
Curtis Samuel CAR 1
(source)
Create a frequency distribution for the number of fumbles made by these players.
21 .
A public opinion poll about an upcoming election asked respondents, “How many political advertisements do you recall seeing on television in the last 24 hours?” The responses were as follows
 6 2 5 5 2 2 4 1 3 0 1 2 1 6 2 5 2 4 8 6 3 3 4 2 5 3 4 2 2 3

Create a frequency distribution for these data.

For the following exercises, use the following table of data on the top 15 receivers (by number of receptions) in the NFL during the 2020 season:
Player Team Age Receptions Yards Yds/Rec TD Long
Stefon Diggs BUF 27 127 1535 12.1 8 55
Davante Adams GNB 28 115 1374 11.9 18 56
DeAndre Hopkins ARI 28 115 1407 12.2 6 60
Darren Waller LVR 28 107 1196 11.2 9 38
Travis Kelce KAN 31 105 1416 13.5 11 45
Allen Robinson CHI 27 102 1250 12.3 6 42
Keenan Allen LAC 28 100 992 9.9 8 28
Tyler Lockett SEA 28 100 1054 10.5 10 47
JuJu Smith-Schuster PIT 24 97 831 8.6 9 31
Robby Anderson CAR 27 95 1096 11.5 3 75
Amari Cooper DAL 26 92 1114 12.1 5 69
Cooper Kupp LAR 27 92 974 10.6 3 55
Calvin Ridley ATL 26 90 1374 15.3 9 63
Robert Woods LAR 28 90 936 10.4 6 56
Justin Jefferson MIN 21 88 1400 15.9 7 71
(source)
22 .
Make a binned frequency distribution for receiving yards (“Yards”) using bins of width 200.
23 .
Make another binned frequency distribution for receiving yards (“Yards”), but this time use bins of width 250.
24 .
Make a binned frequency distribution for number of yards per reception (“Yds/Rec”), using bins of width 1.
25 .
Make a binned frequency distribution for longest reception (“Long”), using bins of width 10.
Order a print copy

As an Amazon Associate we earn from qualifying purchases.