Skip to ContentGo to accessibility pageKeyboard shortcuts menu
OpenStax Logo
Contemporary Mathematics

8.8 Scatter Plots, Correlation, and Regression Lines

Contemporary Mathematics8.8 Scatter Plots, Correlation, and Regression Lines

A scatter plot. The x-axis ranges from 20 to 120, in increments of 20. The y-axis ranges from 0 to 160, in increments of 40. The points are scattered along a line from 40 to 110 on the horizontal axis and 20 to 120 on the vertical axis. The line represents the best line of fit and it passes through the following points: (20, 20), (40, 40), (60, 60), (80, 80), (100, 100), and (120, 122). Note: all values are approximate.
Figure 8.65 A scatter plot is a visualization of the relationship between quantitative dataset.

Learning Objectives

After completing this section, you should be able to:

  1. Construct a scatter plot for a dataset.
  2. Interpret a scatter plot.
  3. Distinguish among positive, negative and no correlation.
  4. Compute the correlation coefficient.
  5. Estimate and interpret regression lines.

One of the most powerful tools statistics gives us is the ability to explore relationships between two datasets containing quantitative values, and then use that relationship to make predictions. For example, a student who wants to know how well they can expect to score on an upcoming final exam may consider reviewing the data on midterm and final exam scores for students who have previously taken the class. It seems reasonable to expect that there is a relationship between those two datasets: If a student did well on the midterm, they were probably more likely to do well on the final than the average student. Similarly, if a student did poorly on the midterm, they probably also did poorly on the final exam.

Of course, that relationship isn’t set in stone; a student’s performance on a midterm exam doesn’t cement their performance on the final! A student might use a poor result on the midterm as motivation to study more for the final. A student with a really good grade on the midterm might be overconfident going into the final, and as a result doesn’t prepare adequately.

The statistical method of regression can find a formula that does the best job of predicting a score on the final exam based on the student’s score on the midterm, as well as give a measure of the confidence of that prediction! In this section, we’ll discover how to use regression to make these predictions. First, though, we need to lay some graphical groundwork.

Relationships Between Quantitative Datasets

Before we can evaluate a relationship between two datasets, we must first decide if we feel that one might depend on the other. In our exam example, it is appropriate to say that the score on the final depends on the score on the midterm, rather than the other way around: if the midterm depended on the final, then we’d need to know the final score first, which doesn’t make sense.

Here’s another example: if we collected data on home purchases in a certain area, and noted both the sale price of the house and the annual household income of the purchaser, we might expect a relationship between those two. Which depends on the other? In this case, sale price depends on income: people who have a higher income can afford a more expensive house. If it were the other way around, people could buy a new, more expensive house and then expect a raise! (This is very bad advice.)

It's worth noting that not every pair of related datasets has clear dependence. For example, consider the percent of a country’s budget devoted to the military and the percent earmarked for public health. These datasets are generally related: as one goes up, the other goes down. However, in this case, there’s not a preferred choice for dependence, as each could be seen as depending on the other. When exploring the relationship between two datasets, if one set seems to depend on the other, we’ll say that dataset contains values of the response variable (or dependent variable). The dataset that the response variable depends on contains values of what we call the explanatory variable (or independent variable). If no dependence relationship can be identified, then we can assign either dataset to either role.

Example 8.44

Identifying Explanatory and Response Variables

For each of the following pairs of related datasets, identify which (if any) should be assigned the role of response variable and which should be assigned to be the explanatory variable.

  1. A person’s height and weight
  2. A professional basketball player’s salary and their average points scored per game (which is a measure of how good they are at basketball)
  3. The length and width of leaves on a tree

Your Turn 8.44

Given these pairs of datasets, identify which (if either) would be the best choice for the response variable.

1.
A person’s age and their annual income
2.
A student’s GPA and their score on the SAT
3.
A student’s GPA and the number of hours they spend studying per week

Once we’ve assigned roles to our two datasets, we can take the first step in visualizing the relationship between them: creating a scatter plot.

Creating Scatter Plots

A scatter plot is a visualization of the relationship between two quantitative sets of data. The scatter plot is created by turning the datasets into ordered pairs: the first coordinate contains data values from the explanatory dataset, and the second coordinate contains the corresponding data values from the response dataset. These ordered pairs are then plotted in the xyxy-plane. Let's return to our exam example to put this into practice.

Example 8.45

Creating Scatter Plots Without Technology

Students are exploring the relationship between scores on the midterm exam and final exam in their math course. Here are some of the scores reported by their classmates:

Name Midterm grade Final grade
Student 1 88 84
Student 2 71 80
Student 3 75 77
Student 4 94 95
Student 5 68 73

Create a scatter plot to visualize the data.

Your Turn 8.45

1.
Create a scatter plot to visualize the following data, showing the top five NFL receivers by number of receptions for the 2019 season. Treat Yards as the response:
Player Receptions Yards
Stefon Diggs 127 1535
Davante Adams 115 1374
DeAndre Hopkins 115 1407
Darren Waller 107 1196
Travis Kelce 105 1416
Table 8.12 (source: https://www.pro-football-reference.com/years/2019/)

For large datasets, it’s impractical to create scatter plots manually. Luckily, Google Sheets automates this process for us.

Example 8.46

Creating Scatter Plots in Google Sheets

The dataset “NHL19” gives the results of the 2018–2019 National Hockey League season. The columns are team, wins (W), losses (L), overtime losses (OTL), total points (PTS), goals scored by the team (GF), goals scored against the team (GA), and goal differential (the difference in GF and GA). Use Google Sheets to create a scatter plot for GF vs. GA.

Checkpoint

When we talk about plotting one set versus another, the first is the response and the second is explanatory.

Your Turn 8.46

1.
With the data in "NHL19," use Google Sheets to create a scatter plot for points (PTS) vs. wins (W).

Reading and Interpreting Scatter Plots

Scatter plots give us information about the existence and strength of a relationship between two datasets. To break that information down, there are a series of questions we might ask to help us. First: Is there a curved pattern in the data? If the answer is “yes,” then we can stop; none of the linear regression techniques from here to the end of this section are appropriate. Figure 8.71 and Figure 8.74 show several examples of scatter plots that can help us identify these curved patterns.

A scatter plot represents a curved pattern. The x-axis ranges from 20 to 120, in increments of 20. The y-axis ranges from negative 100 to 500, in increments of 100. The scatter plot shows points scattered along a curve that passes through the following points: (20, 90), (40, 20), (50, 0), (80, 100), (100, 250), and (120, 500). Note: all values are approximate.
Figure 8.71 Curved pattern
A scatter plot represents no curved pattern. The x-axis ranges from 30 to 110, in increments of 10. The y-axis ranges from 0 to 140, in increments of 20. The points are scattered throughout and it lies from 30 to 100 on the horizontal axis and 10 to 120 on the vertical axis.
Figure 8.72 No curved pattern
A scatter plot represents a curved pattern. The x-axis ranges from 30 to 100, in increments of 10. The y-axis ranges from 40 to 120, in increments of 10. The scatter plot shows points scattered along a curve that passes through the following points: (30, 80), (40, 60), (55, 50), (70, 65), (90, 102), (100, 110), and (100, 105). Note: all values are approximate.
Figure 8.73 Curved pattern
A scatter plot represents no curved pattern. The x-axis ranges from 30 to 110, in increments of 10. The y-axis ranges from 0 to 300, in increments of 50. The points are scattered in linear decreasing order. Some of the points are as follows: (35, 250), (40, 220), (50, 200), (60, 175), (70, 150), (80, 125), (90, 75), and (100, 50). Note: all values are approximate.
Figure 8.74 No curved pattern

Once we have confirmed that there is no curved pattern in our data, we can move to the next question: Is there a linear relationship? To answer this, we must look at different values of the explanatory variable and determine whether the corresponding response values are different, on average. It's important to look at the values “on average” because, in general, our scatter plots won’t include just one corresponding response point for each value of the explanatory variable (i.e., there may be multiple response values for each explanatory value). So, we try to look for the center of those points. Let’s look again at Figure 8.74, but consider some different values for the explanatory variable. Let’s highlight the points whose xx-values are around 50 and those that are around 80:

A scatter plot. The x-axis ranges from 30 to 110, in increments of 10. The y-axis ranges from 0 to 300, in increments of 50. The points are scattered in linear decreasing order. Some of the points are highlighted in red and the points are as follows: (50, 290), (48, 195), (48, 200), (49.5, 225), (50, 203), (78, 145), (79, 130), (80, 120), (81, 120), (81, 70), (82, 90), (82.5, 90), (82, 100), and (82, 120). Note: all values are approximate.
Figure 8.75

Now, we can estimate the middle of each group of points. Let's add our estimated averages to the plot as starred points:

A scatter plot. The x-axis ranges from 30 to 110, in increments of 10. The y-axis ranges from 0 to 300, in increments of 50. The points are scattered in linear decreasing order. Some of the points are highlighted in red and the points are as follows: (50, 290), (48, 195), (48, 200), (49.5, 225), (50, 203), (78, 145), (79, 130), (80, 120), (81, 120), (81, 70), (82, 90), (82.5, 90), (82, 100), and (82, 120). Asterisks are marked at (50, 200) and (80, 120). Note: all values are approximate.
Figure 8.76

Since those two starred points occur at different heights, we can conclude that there’s likely a relationship worth exploring.

Here’s another example using a different set of data:

A scatter plot. The x-axis ranges from 30 to 110, in increments of 10. The y-axis ranges from 0 to 250, in increments of 50. The points are scattered throughout and it lies from 30 to 105 on the horizontal axis and 50 to 250 on the vertical axis.
Figure 8.77

Let’s look again at the points near 50 and near 80, and estimate the middles of those clusters:

A scatter plot. The x-axis ranges from 30 to 110, in increments of 10. The y-axis ranges from 0 to 250, in increments of 50. The points are scattered throughout and it lies from 30 to 105 on the horizontal axis and 50 to 250 on the vertical axis. Some of the points are marked in red: (48, 120), (48, 140), (50, 150), (50, 155), (48, 190), (51, 195), (82, 75), (80, 125), (81, 130), (80, 160), (81, 160), (82, 140), (81, 195), (81, 200), and (78, 210). Asterisks are marked at (50, 150) and (80, 170). Note all values are approximate.
Figure 8.78

Notice that there’s not much vertical distance between our two starred points. This tells us that there’s not a strong relationship between these two datasets.

Positive and Negative Linear Relationships

Another way to assess whether there is a relationship between two datasets in a scatter plot is to see if the points seem to be clustered around a line (specifically, a line that’s not horizontal). The stronger the clustering around that line is, the stronger the relationship.

Once we’ve established that there’s a relationship worth exploring, it’s time to start quantifying that relationship. Two datasets have a positive linear relationship if the values of the response tend to increase, on average, as the values of the explanatory variable increase. If the values of the response decrease with increasing values of the explanatory variable, then there is a negative linear relationship between the two datasets. The strength of the relationship is determined by how closely the scatter plot follows a single straight line: the closer the points are to that line, the stronger the relationship. The scatter plots in Figure 8.74 to Figure 8.80 depict varying strengths and directions of linear relationships.

A scatter plot. The x-axis ranges from 30 to 110, in increments of 10. The y-axis ranges from 210 to 290, in increments of 10. The points are arranged in linear decreasing order in a single row. Some of the points are as follows: (32, 288), (40, 280), (60, 260), (70, 250), (90, 230), and (100, 220). Note: all values are approximate.
Figure 8.79 Perfect negative relationship
A scatter plot. The x-axis ranges from 30 to 110, in increments of 10. The y-axis ranges from 200 to 300, in increments of 20. The points are arranged in linear decreasing order in multiple rows. Some of the points are as follows: (32, 287), (40, 281), (60, 260), (70, 245), (90, 230), and (100, 225). Note: all values are approximate.
Figure 8.80 Strong negative relationship
A scatter plot. The x-axis ranges from 30 to 110, in increments of 10. The y-axis ranges from 150 to 350, in increments of 50. The points are scattered throughout. Most points lie from 30 to 100 on the horizontal axis and 200 to 300 on the vertical axis.
Figure 8.81 Weak negative relationship
A scatter plot. The x-axis ranges from 30 to 110, in increments of 10. The y-axis ranges from 0 to 250, in increments of 50. The points are scattered throughout. Most points lie from 30 to 105 on the horizontal axis and 50 to 250 on the vertical axis.
Figure 8.82 No relationship
A scatter plot. The x-axis ranges from 20 to 110, in increments of 10. The y-axis ranges from 300 to 500, in increments of 20. The points are scattered at the center of the graph. Most points lie from 50 to 90 on the horizontal axis and 340 to 440 on the vertical axis.
Figure 8.83 Weak positive relationship
A scatter plot. The x-axis ranges from 20 to 110, in increments of 10. The y-axis ranges from 300 to 500, in increments of 20. The points are scattered in linear increasing order in a single row. Some of the points are as follows: (27, 345), (40, 360), (60, 380), (70, 390), (80, 400), and (100, 420). Note: all values are approximate.
Figure 8.84 Strong positive relationship
A scatter plot. The x-axis ranges from 20 to 110, in increments of 10. The y-axis ranges from 300 to 500, in increments of 20. The points are scattered in linear increasing order in a single row. Some of the points are as follows: (27, 345), (40, 360), (60, 380), (70, 390), (80, 400), and (100, 420). Note: all values are approximate.
Figure 8.85 Perfect positive relationship

The strength and direction (positive or negative) of a linear relationship can also be measured with a statistic called the correlation coefficient (denoted rr). Positive values of rr indicate a positive relationship, while negative values of rr indicate a negative relationship. Values of rr close to 0 indicate a weak relationship, while values close to ±1±1 correspond to a very strong relationship. Looking again at Figure 8.74 to Figure 8.80, the correlation coefficients for each, in sequential order, are: ‒1, ‒0.97, ‒0.55, ‒0.03, 0.61, 0.97, and 1.

There’s no firm rule that establishes a cutoff value of rr to divide strong relationships from weak ones, but ±0.7±0.7 is often given as the dividing line (i.e., if r>0.7r>0.7 or r<-0.7r<-0.7 the relationship is strong, and if -0.7<r<0.7-0.7<r<0.7 the relationship is weak).

The formula for computing rr is very complicated; it’s almost never done without technology. Google Sheets will do the computation for you using the CORREL function. The syntax works like this: if your explanatory values are in cells A2 to A50 and the corresponding response values are in B2 to B50, then you can find the correlation coefficient by entering “=CORREL(A2:A50, B2:B50)”. (Note that the order doesn’t matter for correlation coefficients; “=CORREL(B2:B50, A2:A50)” will give the same result.)

Let’s put all of this together in an example.

Example 8.47

Interpreting Scatter Plots

Consider the four scatter plots below:

  1. A scatter plot shows points arranged in a parabolic path. The x-axis ranges from 30 to 20, in increments of 10. The y-axis ranges from negative 10 to 60, in increments of 10. The points are scattered in the form of an open upward parabola. Some of the points are as follows: (negative 20, 45), (negative 10, 10), (0, 0), (10, 10), and (15, 22). Note: all values are approximate.
    Figure 8.86
  1. A scatter plot shows points arranged in increasing order. The x-axis ranges from 30 to 20, in increments of 10. The y-axis ranges from negative 100 to 150, in increments of 50. The points are scattered in increasing order. Some of the points are as follows: (negative 20, negative 50), (negative 10, 0), (0, 40), (10, 75), and (15, 100). Note: all values are approximate.
    Figure 8.87
  1. A scatter plot shows points arranged in decreasing order. The x-axis ranges from 30 to 20, in increments of 10. The y-axis ranges from negative 20 to 20, in increments of 10. The points are scattered in decreasing order. Some of the points are as follows: (negative 20, 10), (negative 10, 5), (0, 0), (10, negative 5), and (15, negative 15). Note: all values are approximate.
    Figure 8.88
  1. A scatter plot. The x-axis ranges from 30 to 20, in increments of 10. The y-axis ranges from 0 to 40, in increments of 5. The points are scattered throughout and the points lie from negative 20 to 15 on the horizontal axis and 0 to 30 on the vertical axis.
    Figure 8.89

For each of these, answer the following questions:

  1. Is there a curved pattern in the data? If yes, stop here. If no, continue to part b.
  2. Classify the strength and direction of the relationship. Make a guess at the value of rr.

Your Turn 8.47

For each of the plots below, answer the following questions:

  1. Is there a curved pattern in the data? If yes, stop here. If no, continue to part b.
  2. Classify the strength and direction of the relationship. Make a guess at the value of r .
1.
 A scatter plot shows points arranged in decreasing order. The x-axis ranges from 30 to 30, in increments of 10. The y-axis ranges from negative 50 to 250, in increments of 50. The points are scattered in decreasing order. Some of the points are as follows: (negative 20, 150), (negative 10, 100), (0, 100), (10, 50), and (23, 0). Note: all values are approximate.
2.
A scatter plot shows points arranged in decreasing order. The x-axis ranges from 30 to 30, in increments of 10. The y-axis ranges from negative 2 to 12, in increments of 2. The points are scattered in decreasing order and it takes a curved path. Some of the points are as follows: (negative 20, 9), (negative 10, 4), (0, 2), (10, 0), and (22, 0). Note: all values are approximate.
3.
A scatter plot. The x-axis ranges from negative 30 to 30, in increments of 10. The y-axis ranges from negative 20 to 120, in increments of 20. The points are scattered throughout. The points lie from negative 20 to 20 on the horizontal axis and 20 to 100 on the vertical axis.
4.
A scatter plot shows points arranged in increasing order. The x-axis ranges from negative 30 to 30, in increments of 10. The y-axis ranges from negative 20 to 120, in increments of 20. The points are scattered throughout. The points lie from negative 20 to 20 on the horizontal axis and 20 to 100 on the vertical axis.

Example 8.48

Finding the Correlation Coefficient

The data that were plotted in the previous example can be found in the dataset “correlationcoefficient1”. All of them share the same values for the explanatory variable xx. The four responses are labeled y1y1 through y4 y4. Compute the correlation coefficients for each, if appropriate, using Google Sheets. Round to the nearest hundredth.

Your Turn 8.48

1.
The data that were plotted in Your Turn 8.47 can be found in the dataset “correlationcoefficient2”. All of them share the same values for the explanatory variable x . The four responses are labeled y 1 through y 4 . Compute the correlation coefficients for each, if appropriate, using Google Sheets. Round to the nearest hundredth.

WORK IT OUT

Winning with Statistics

Billy Beane, the former general manager of the Oakland A’s baseball team, famously took his low budget team to unprecedented heights by using statistics to identify undervalued players; his story is recounted in the book Moneyball (which was later made into a movie, with Brad Pitt playing Beane). You can do the same thing: Take a look at team statistics in the sport of your choice and try to identify a statistic that’s most closely related to winning (meaning that it has the highest correlation coefficient with team wins).

Linear Regression

The final step in our analysis of the relationship between two datasets is to find and use the equation of the regression line. For a given set of explanatory and response data, the regression line (also called the least-squares line or line of best fit) is the line that does the best job of approximating the data.

What does it mean to say that a particular line does the “best job” of approximating the data? The way that statisticians characterize this “best line” is rather technical, but we’ll include it for the sake of satisfying your curiosity (and backing up the claim of "best"). Imagine drawing a line that looks like it does a pretty good job of approximating the data. Most of the points in the scatter plot will probably not fall exactly on the line; the distance above or below the line a given point falls is called that point’s residual. We could compute the residuals for every point in the scatter plot. If you take all those residuals and square them, then add the results together, you get a statistic called the sum of squared errors for the line (the name tells you what it is: “sum” because we’re adding, “squared” because we’re squaring, and “errors” is another word for “residuals”). The line that we choose to be the “best” is the one that has the smallest possible sum of squared errors. The implied minimization (“smallest”) is where the “least” in “least squares” comes from; the “squares” comes from the fact that we’re minimizing the sum of squared errors. This is very similar to the process we outlined in the "game" that we used to introduce the mean. Both the regression line and the mean are designed to minimize a sum of squared errors. Here ends the super technical part.

Finding the Equation of the Regression Line

So, how do we find the equation of the regression line? Recall the point-slope form of the equation of a line:

FORMULA

If a line has slope mm and passes through a point (x0,y0)(x0,y0), then the point-slope form of the equation of the line is:

y=m(x-x0)+y0y=m(x-x0)+y0

The regression line has two properties that we can use to find its equation. First, it always passes through the point of means. If x¯x¯ and y¯y¯ are the means of the explanatory and response datasets, respectively, then the point of means is (x¯,y¯)(x¯,y¯). We’ll use that as the point in the point-slope form of the equation. Second, if sxsx and sysy are the standard deviations of the explanatory and response datasets, respectively, and if rr is the correlation coefficient, then the slope is m=r×sysxm=r×sysx. Putting all that together with the point-slope formula gives us this:

FORMULA

Suppose xx and yy are explanatory and response datasets that have a linear relationship. If their means are x¯x¯ and y¯y¯ respectively, their standard deviations are sxsx and sysy respectively, and their correlation coefficient is rr, then the equation of the regression line is:

y=r(sysx)(x-x¯)+y¯y=r(sysx)(x-x¯)+y¯.

Let's walk through an example.

Example 8.49

Finding the Equation of the Regression Line from Statistics

Suppose you have datasets xx and yy with the following statistics: xx has mean 21 and standard deviation 4, yy has mean 8 and standard deviation 2, and their correlation coefficient is −0.4. What’s the equation of the regression line?

Your Turn 8.49

1.
Suppose you have datasets x and y with the following statistics: x has mean 100 and standard deviation 5, y has mean 200 and standard deviation 20, and their correlation coefficient is 0.75. What’s the equation of the regression line?

As you can see, finding the equation of the regression line involves a lot of steps if you have to find all of the values of the needed quantities yourself. But, as usual, technology comes to our rescue. This video (which you actually watched earlier when learning how to create scatter plots) covers the regression line at around the 3:30 mark. Note that Google Sheets calls it the "trendline."

Let's put this into practice.

Example 8.50

Finding the Equation of the Regression Line Using Google Sheets

In Example 8.46, we considered the relationship between goals scored (GF) and goals against (GA) using the dataset “NHL19”. Recreate the scatter plot in Google Sheets, and use it to find the equation of the regression line.

Your Turn 8.50

1.
In Your Turn 8.46, you created a scatter plot for points (PTS) vs. wins (W) using the dataset “NHL19”. Recreate the scatter plot in Google Sheets, and use it to find the equation of the regression line.

Using the Equation of the Regression Line

Once we’ve found the equation of the regression line, what do we do with it? We’ll look at two possible applications: making predictions and interpreting the slope.

We can use the equation of the regression line to predict the response value yy for a given explanatory value xx. All we have to do is plug that explanatory value into the formula and see what response value results. This is useful in two ways: first, it can be used to make a guess about an unknown data value (like one that hasn’t been observed yet). Second, it can be used to evaluate performance (meaning, we can predict an outcome given a particular event). In Example 8.45, we created a scatter plot of final exam scores vs. midterm exam scores using this data:

Name Midterm Grade Final Grade
Allison 88 84
Benjamin 71 80
Carly 75 77
Daniel 94 95
Elmo 68 73

The equation of the regression line is y=0.687x+27.4y=0.687x+27.4, where yy is the final exam score and x x is the midterm exam score. If Frank scored 85 on the midterm, then our prediction for his final exam score is 0.687×85+27.4=85.7950.687×85+27.4=85.795. To use the regression line to evaluate performance, we use a data value we’ve already observed. For example, Allison scored 88 on the midterm. The regression line predicts that someone who scores an 88 on the midterm will get 0.687×88+27.4=87.8560.687×88+27.4=87.856 on the final. Allison actually scored 84 on the final, meaning she underperformed expectations by almost 4 points (87.856-84)(87.856-84).

The second application of the equation of the regression line is interpreting the slope of the line to describe the relationship between the explanatory and response datasets. For the exam data in the previous paragraph, the slope of the regression line is 0.687. Recall that the slope of a line can be computed by finding two points on the line and dividing the difference in the yy-values of those points by the difference in the xx-values. Keeping that in mind, we can interpret our slope as 0.687=difference in final scoresdifference in midterm scores0.687=difference in final scoresdifference in midterm scores . Multiplying both sides of that equation by the denominator of the fraction, we get 0.687×difference in midterm scores=difference in final scores0.687×difference in midterm scores=difference in final scores. Thus, a one-point increase in the midterm score would result in a predicted increase in the final score of 0.687 points. A ten-point drop in the midterm score would give us a decrease in the predicted final score of 6.87 points. In general, the slope gives us the predicted change in the response that corresponds to a one unit increase in the explanatory variable.

Example 8.51

Applying the Equation of the Regression Line

The data in “MLB2019Off” gives offensive team stats for the 2019 Major League Baseball season. Use that dataset to answer the following questions:

  1. What is the equation of the regression line for runs (R) vs. hits (H)?
  2. How many runs would we expect a team to score if the team got 1500 hits in a season?
  3. Did the Kansas City Royals (KCR) overperform or underperform in terms of runs scored, based on their hit total? By how much?
  4. Write a sentence to interpret the slope of the regression line.

Your Turn 8.51

Using the “MLB2019Off” dataset, answer the following:
1.
What is the equation of the regression line for the number of times a runner is caught stealing a base (CS) vs. the number of successful stolen bases (SB)?
2.
How many times would we expect a team to be caught stealing if the team steals 70 bases in a season?
3.
Did the Philadelphia Phillies (PHI) overperform or underperform in terms of getting caught stealing, based on their stolen base total? By how much?
4.
Write a sentence to interpret the slope of the regression line.

Who Knew?

Math and the Movies

Statistics and regression are used by Hollywood movie producers to decide what movies to make, and to predict how much money they’ll earn at the box office. According to the American Statistical Association, not only do producers use statistics to identify the next potential blockbuster, but they’ve also pinned down how much money awards add to the bottom line. (An Academy Award is worth about $3 million!) In addition, studios use their streaming services to gather data about their customers and the types of movies they watch; this data helps them learn what kinds of entertainment their customers want more of.

WORK IT OUT

Collecting and Analyzing Your Own Data

This section has demonstrated many pairs of related quantitative datasets. Think about some quantitative variables that you can ask your classmates about, which might be related. Once you have some ideas, collect the data from your classmates. Then analyze the data by creating a scatter plot, finding the equation of the regression line (if appropriate), and interpreting it.

Extrapolation

A very common misuse of regression techniques involves extrapolation, which involves making a prediction about something that doesn't belong in the dataset.

Example 8.52

More Applying the Equation of the Regression Line

The data in “WNBA2019” gives team statistics from the 2019 WNBA season. Use that dataset to answer these questions about team wins (W) and the proportion of team field goals made (FG%, the number of shots made divided by the number of shots attempted. Even though this column is labeled using a percent sign, the values are not expressed as percentages):

  1. What is the equation of the regression line for wins vs. proportion of made field goals?
  2. How many wins would we expect for a team that makes 42% of its shots?
  3. Did the New York Liberty overperform or underperform in terms of wins, based on the team’s proportion of made field goals?
  4. Write a sentence to interpret the slope of the regression line.

Your Turn 8.52

Use the data in “WNBA2019” to answer these questions about the relationship between the proportion of made field goals (FG%) and the proportion of made three-point field goals (3P%):
1.
What is the equation of the regression line for proportion of made three-point field goals vs. proportion of made field goals?
2.
What proportion of made three-point field goals would we expect for a team that makes 44% of its field goals?
3.
Did the Dallas Wings overperform or underperform in terms of proportion of made three-point field goals, based on the team’s proportion of made field goals?
4.
Write a sentence to interpret the slope of the regression line.

Correlation Does Not Imply Causation

One of the most common fallacies about statistics has to do with the relationship between two datasets. In the dataset “Public”, we find that the correlation coefficient between the 75th percentile math SAT score and the 75th percentile verbal SAT score is 0.92, which is really strong. The slope of the regression line that predicts the verbal score from the math score is 0.729, which we might interpret as follows: “If the 75th percentile math SAT score goes up by 10 points, we’d expect the corresponding verbal SAT score to go up by just over 7 points.”

Does the increasing math score cause the increase in the verbal score? Probably not. What’s really going on is that there’s a third variable that’s affecting them both: To raise the SAT math score by 10 points, a school will recruit students who do better on the SAT in general; these students will also naturally have higher SAT verbal scores. This third variable is sometimes called a lurking variable or a confounding variable. Unless all possible lurking variables are ruled out, we cannot conclude that one thing causes another.

People in Mathematics

Dr. Talithia Williams

A photo of Dr. Williams
Figure 8.90 A photo of Dr. Williams (credit: Used by permission of Talithia Williams)

Dr. Talithia Williams is a statistician on the faculty of Harvey Mudd College, and the first Black woman to achieve tenure at this university. She advocates for more women to become involved in the fields of engineering and science, and is on the board of directors for the EDGE Foundation, an organization that helps women obtain advanced degrees in mathematics (EDGE standing for Enhancing Diversity in Graduate Education). In 2018, Dr. Williams published the book Power in Numbers: The Rebel Women of Mathematics, a retrospective look at historical female figures who have contributed to the development of the field of mathematics.

Dr. Williams earned a Master’s degree in Mathematics from Howard University and a Master’s in Statistics from Rice University, and also went on to earn her Ph.D. in Statistics from Rice. She has held research appointments at the Jet Propulsion Laboratory, the National Security Agency, and NASA. Her research focuses on the environmental and medical applications of statistics. In 2014, she gave a popular TED talk titled “Own Your Body’s Data” that discussed the potential insights to be gained from collecting personal health data. She was even recently a host for the NOVA Wonders documentary series and a narrator for the NOVA Universe series on PBS.

To stay up-to-date on Dr. Williams’s accomplishments, you can follow her on Twitter or her Facebook account.

Who Knew?

Statistics and Eugenics

Some of the brightest minds in the history of statistics unfortunately decided to use their considerable intellects to further a pseudoscience known as eugenics. Eugenicists took Charles Darwin’s theories of evolution and ruthlessly applied them to the human race. Francis Galton (1822–1911), a cousin of Darwin and also the mathematician who invented the formula for standard deviation, claimed that people in the British upper classes possessed higher intelligence due to their superior breeding. Karl Pearson (1857–1936), who derived the formula for the correlation coefficient, argued in National Life from the Standpoint of Science, that, instead of providing social welfare programs, nations could better improve the fortunes of the poor by waging “war with inferior races.” Ronald Fisher (1890–1962) was possibly the most important statistician of the 20th century, having invented several new techniques (including the ubiquitous analysis of variance), and yet he also founded the Cambridge University Undergraduates Eugenics Society, whose self-prescribed goal was to evangelize “not by precept only, but by example, the doctrine of a new natural ability of worth and blood.”

When eugenics took hold in the United States, it was used to justify terrible acts by the government, including the forced sterilization of individuals with mental illness, epilepsy, a physical impairment (like blindness), or a criminal history. The Nazi regime took these ideas to their ultimate, terrible conclusion: killing people who had mental or physical disabilities, or who were born into an “inferior” race. Over six million people died in this Holocaust, one of the darkest events in human history. To learn more, watch this video about Francis Galton and the legacy of eugenics

Check Your Understanding

59.
Make a scatter plot for the following data without technology:
x 20 11 8 22 25
y 13 15 17 13 10

For the following problems, answer these questions:

  1. Is there a curved pattern in the data? If yes, stop here.
  2. Classify the strength and direction of the relationship. Make a guess at the value of r .

60.
 A scatter plot. The x-axis ranges from 20 to 70, in increments of 10. The y-axis ranges from 92 to 104, in increments of 2. The scatter plot shows the points arranged in the form of an open downward parabola and some of the points are as follows: (30, 94), (35, 97), (40, 99), (45, 100), (50, 101), (55, 100), and (95, 95). The region between 40 and 60 on the horizontal axis has more points. Note: all values are approximate.
61.
 A scatter plot. The x-axis ranges from 20 to 70, in increments of 10. The y-axis ranges from negative 20 to 120, in increments of 20. The points are scattered throughout. The points lie from 30 to 70 on the horizontal axis and 0 to 80 on the vertical axis.
Use the data in "MLB2019Off" to investigate the relationship between slugging percentage (SLG, explanatory) and runs scored (R, response).
62.
What’s the correlation coefficient? Round to the nearest hundredth.
63.
What’s the equation of the regression line? Round the slope and intercept to the nearest whole number.

The regression equation used to predict average monthly faculty salary (FacSal) from out-of-state tuition (OutState) using the data in “TNSchools” is y = 0.161   x + 2645

64.
Predict the average monthly faculty salary for a school that charges $30,000 in out-of-state tuition.
65.
Maryville College charges $34,880 for out-of-state students, and their average monthly faculty salary is $6,765. Do they pay faculty more or less than expected? By how much?
66.
Write a sentence to interpret the slope.

Section 8.8 Exercises

1 .
This table contains data for the first five schools (alphabetically) that fielded an NCAA Division I men’s basketball team in the 2018–2019 season. It shows the total number of points each team scored (PF) and the total number of points their opponents scored against them (PA). Create a scatter plot without technology of PA vs. PF.
School PF PA
Abilene Christian 2502 2161
Air Force 2179 2294
Akron 2271 2107
Alabama A&M 1938 2285
Alabama-Birmingham 2470 2370
(source: www.sports-reference.com)
2 .
This table contains data for the first five schools (alphabetically) that fielded an NCAA Division I men’s basketball team in the 2018–2019 season. It shows the total number of field goals each team scored (FG) and the total number of three-point field goals they scored (3P). Create a scatter plot without technology of 3P vs. FG.
School FG 3P
Abilene Christian 897 251
Air Force 802 234
Akron 797 297
Alabama A&M 736 182
Alabama-Birmingham 906 234
(source: www.sports-reference.com)
For the following exercises, use the data in “MBB2019”, on every school that fielded an NCAA Division I men’s basketball team in the 2018–2019 season.
3 .
Use Google Sheets to create a scatter plot of points scored against a team (PA) vs. points scored by the team (PF).
4 .
Use Google Sheets to create a scatter plot of number of three-point field goals made (3P) vs. total field goals made (FG).
5 .
Use Google Sheets to create a scatter plot of number of fouls (Fouls) vs. number of blocks (BLK).
6 .
Use Google Sheets to create a scatter plot of points scored (PF) vs. percent of three-point shots made (3P%).

For the following exercises, answer the following questions:

  1. Is there a curved pattern in the data? If yes, stop here. If no, continue to part b.
  2. Classify the strength and direction of the relationship. Make a guess at the value of r .

7 .
A scatter plot shows points arranged in decreasing order. The x-axis ranges from 10 to 90, in increments of 10. The y-axis ranges from 0 to 200, in increments of 50. The points are scattered in decreasing order. Some of the points are as follows: (20, 170), (30, 150), (50, 100), (70, 75), and (75, 50). Note: all values are approximate.
8 .
A scatter plot shows points scattered in increasing order. The x-axis ranges from 10 to 90, in increments of 10. The y-axis ranges from negative 40 to 60, in increments of 20. Most of the points are scattered above the x-axis and some of the points are scattered below the axis. The points are in increasing order and scattered almost throughout the graph. Most points lie from 30 to 80 on the horizontal axis and 0 to 50 on the vertical axis.
9 .
A scatter plot. The x-axis ranges from 10 to 100, in increments of 10. The y-axis ranges from negative 60 to 60, in increments of 20. Most of the points are scattered at the center of the graph. The points lie above and below the horizontal axis. Most points lie from 20 to 80 on the horizontal axis and negative 20 to 40 on the vertical axis.
10 .
A scatter plot shows a curved pattern. The x-axis ranges from 10 to 90, in increments of 10. The y-axis ranges from negative 100 to 200, in increments of 50. The points follow a curved pattern and the points are arranged in increasing order. Some of the points are as follows: (15, 50), (30, 25), (40, 0), (50, 0), (60, 25), (70, 50), and (75, 100). Note: all values are approximate.
11 .
A scatter plot. The x-axis ranges from 7 to 13, in increments of 1. The y-axis ranges from 0 to 10, in increments of 2. Most of the points are scattered at the center of the graph. The points lie from 8 to 12 on the horizontal axis and 2 to 9 on the vertical axis.
12 .
A scatter plot shows a curved pattern. The x-axis ranges from 7 to 13, in increments of 1. The y-axis ranges from negative 50 to 200, in increments of 50. The points are scattered in decreasing order and it takes a curved path. Some of the points are as follows: (8, 150), (9, 125), (10, 125), (11, 100), (11.5, 75), and (12, 0). Note: all values are approximate.
13 .
A scatter plot shows points arranged in increasing order. The x-axis ranges from 7 to 13, in increments of 1. The y-axis ranges from 6 to 14, in increments of 1. The points are scattered in increasing order in multiple rows. Some of the points are as follows: (8, 8), (9, 8), (10, 9), (11, 11), and (12, 13). Note: all values are approximate.
14 .
A scatter plot shows points scattered throughout. The x-axis ranges from 7 to 13, in increments of 1. The y-axis ranges from negative 4 to 8, in increments of 2. Most of the points are scattered at the center of the graph, above and below the horizontal axis. Most points lie from 8.5 to 11 on the horizontal axis and negative 2 to 4 on the vertical axis.
For the following exercises, use the data in “MBB2019” on every school that fielded an NCAA Division I men’sbasketball team in the 2018–2019 season.
15 .
What is the correlation coefficient for points scored against a team (PA) vs. points scored by the team (PF)? Round to the nearest hundredth.
16 .
What is the equation of the regression line for PA vs. PF?
17 .
Predict the total number of points scored against a team that itself scores 2200 points.
18 .
Georgia Tech scored 2091 points, and had 2130 points scored against them. Is their PA higher or lower than expected? By how much?
19 .
Write a sentence that interprets the slope of the regression line for PA vs. PF.
20 .
What is the correlation coefficient for three-point field goals made (3P) vs. total field goals made (FG)? Round to the nearest hundredth.
21 .
What is the equation of the regression line for 3P vs. FG?
22 .
How many three-point field goals made would you expect for a team that made 1000 total field goals?
23 .
Seton Hall made 888 field goals; of those, 240 were three-point field goals. Did they make more or fewer three-point field goals than expected? How many more or fewer?
24 .
Write a sentence to interpret the slope of the regression line for 3P vs. FG.
For the following exercises, use the datasets “Public” and “Private”, which give many institutions of higher learning in the United States (public institutions in “Public” and private, non-profit institutions in “Private”), the schools’ 75th percentiles on the math section of the SAT (SATM75), the verbal section of the SAT (SATV75), the math section of the ACT (ACTM75), and the English section of the ACT (ACTE75). It also gives the schools’ admission rates (AdmRate) and total annual cost of attendance (Cost).
25 .
It might seem reasonable to expect the cost to attend a school to go down as the proportion of applicants admitted goes up. Create two scatter plots (one for private schools, one for public) to investigate that hunch. Can we use linear regression to describe that relationship for these? Why or why not?
26 .
Find the correlation between the 75th percentiles of the two sections of the SAT at public schools and at private schools. Which has a stronger relationship?
27 .
What score would we predict falls at the 75th percentile on the verbal section of the SAT at a public school where the 75th percentile on the math section of the SAT is 500?
28 .
What score would we predict falls at the 75th percentile on the verbal section of the SAT at a private school where the 75th percentile on the math section of the SAT is 500?
29 .
Find the slope of the regression line that we would use to predict the 75th percentile SAT math score from the 75th percentile ACT English score at public schools, and write a sentence to interpret that slope.
30 .
Predict the cost of attendance at a public school whose 75th percentile on the SAT verbal section is 700.
31 .
The cost of attendance at DePauw University, a private school, is $62,567. The 75th percentile on the SAT math section is 680. Is DePauw more or less expensive that we would predict based on the SAT math score? By how much?
32 .
The cost of attendance at Coastal Carolina University, a public school, is $24,599. The 75th percentile of ACT English scores at Coastal Carolina is 24. Is the cost higher or lower than we would expect based on the ACT English score? By how much?
33 .
Find the equation of the regression line that we would use to predict the 75th percentile ACT English score from the 75th percentile ACT math score at public institutions.
34 .
Find the equation of the regression line that we would use to predict cost of attendance at public schools using the 75th percentile ACT math score.
35 .
Does the University of Hawai’i at Hilo have a higher or lower 75th percentile verbal SAT score (590) than we’d expect based on its 75th percentile math SAT score (580)? By how much?
36 .
Find the slope of the regression line we would use to estimate cost from the 75th percentile SAT math scores at public institutions. Write a sentence to interpret that slope.
37 .
Find the slope of the regression line we would use to estimate cost from the 75th percentile SAT math scores at private institutions. Write a sentence to interpret that slope.
38 .
Look at the scatter plots that show the relationship between cost and the 75th percentiles of the various test scores at private institutions. Which (if any) of the four exhibit a pattern that rules out analysis using linear regression?
39 .
Look at the scatter plots that show the relationship between cost and the 75th percentiles of the various test scores at public institutions. Which (if any) of the four exhibit a pattern that rules out analysis using linear regression?
40 .
Looking at public institutions, rank the four test scores from highest to lowest in terms of the strength of their relationships to cost.
Citation/Attribution

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution License and you must attribute OpenStax.

Attribution information
  • If you are redistributing all or part of this book in a print format, then you must include on every physical page the following attribution:
    Access for free at https://openstax.org/books/contemporary-mathematics/pages/1-introduction
  • If you are redistributing all or part of this book in a digital format, then you must include on every digital page view the following attribution:
    Access for free at https://openstax.org/books/contemporary-mathematics/pages/1-introduction
Citation information

© Jul 25, 2024 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.