Skip to ContentGo to accessibility pageKeyboard shortcuts menu
OpenStax Logo
College Algebra

4.3 Fitting Linear Models to Data

College Algebra4.3 Fitting Linear Models to Data

Learning Objectives

In this section, you will:

  • Draw and interpret scatter diagrams.
  • Use a graphing utility to find the line of best fit.
  • Distinguish between linear and nonlinear relations.
  • Fit a regression line to a set of data and use the linear model to make predictions.

A professor is attempting to identify trends among final exam scores. His class has a mixture of students, so he wonders if there is any relationship between age and final exam scores. One way for him to analyze the scores is by creating a diagram that relates the age of each student to the exam score received. In this section, we will examine one such diagram known as a scatter plot.

Drawing and Interpreting Scatter Plots

A scatter plot is a graph of plotted points that may show a relationship between two sets of data. If the relationship is from a linear model, or a model that is nearly linear, the professor can draw conclusions using his knowledge of linear functions. Figure 1 shows a sample scatter plot.

Scatter plot, titled 'Final Exam Score VS Age'. The x-axis is the age, and the y-axis is the final exam score. The range of ages are between 20s - 50s, and the range for scores are between upper 50s and 90s.
Figure 1 A scatter plot of age and final exam score variables

Notice this scatter plot does not indicate a linear relationship. The points do not appear to follow a trend. In other words, there does not appear to be a relationship between the age of the student and the score on the final exam.

Example 1

Using a Scatter Plot to Investigate Cricket Chirps

Table 1 shows the number of cricket chirps in 15 seconds, for several different air temperatures, in degrees Fahrenheit5. Plot this data, and determine whether the data appears to be linearly related.

Chirps 44 35 20.4 33 31 35 18.5 37 26
Temperature 80.5 70.5 57 66 68 72 52 73.5 53
Table 1 Cricket Chirps vs Air Temperature

Finding the Line of Best Fit

Once we recognize a need for a linear function to model that data, the natural follow-up question is “what is that linear function?” One way to approximate our linear function is to sketch the line that seems to best fit the data. Then we can extend the line until we can verify the y-intercept. We can approximate the slope of the line by extending it until we can estimate the rise run . rise run .

Example 2

Finding a Line of Best Fit

Find a linear function that fits the data in Table 1 by “eyeballing” a line that seems to fit.

Analysis

This linear equation can then be used to approximate answers to various questions we might ask about the trend.

Recognizing Interpolation or Extrapolation

While the data for most examples does not fall perfectly on the line, the equation is our best guess as to how the relationship will behave outside of the values for which we have data. We use a process known as interpolation when we predict a value inside the domain and range of the data. The process of extrapolation is used when we predict a value outside the domain and range of the data.

Figure 4 compares the two processes for the cricket-chirp data addressed in Example 2. We can see that interpolation would occur if we used our model to predict temperature when the values for chirps are between 18.5 and 44. Extrapolation would occur if we used our model to predict temperature when the values for chirps are less than 18.5 or greater than 44.

There is a difference between making predictions inside the domain and range of values for which we have data and outside that domain and range. Predicting a value outside of the domain and range has its limitations. When our model no longer applies after a certain point, it is sometimes called model breakdown. For example, predicting a cost function for a period of two years may involve examining the data where the input is the time in years and the output is the cost. But if we try to extrapolate a cost when x=50, x=50, that is in 50 years, the model would not apply because we could not account for factors fifty years in the future.

Scatter plot, showing the line of best fit. It is titled 'Cricket Chirps Vs Air Temperature'. The x-axis is 'c, Number of Chirps', and the y-axis is 'T(c), Temperature (F)'.  The area around the scattered points is enclosed in a box labeled: Interpolation.  The area outside of this box is labeled: Extrapolation.
Figure 4 Interpolation occurs within the domain and range of the provided data whereas extrapolation occurs outside.

Interpolation and Extrapolation

Different methods of making predictions are used to analyze data.

The method of interpolation involves predicting a value inside the domain and/or range of the data.
The method of extrapolation involves predicting a value outside the domain and/or range of the data.
Model breakdown occurs at the point when the model no longer applies.

Example 3

Understanding Interpolation and Extrapolation

Use the cricket data from Table 1 to answer the following questions:

  1. Would predicting the temperature when crickets are chirping 30 times in 15 seconds be interpolation or extrapolation? Make the prediction, and discuss whether it is reasonable.
  2. Would predicting the number of chirps crickets will make at 40 degrees be interpolation or extrapolation? Make the prediction, and discuss whether it is reasonable.

Analysis

Our model predicts the crickets would chirp 8.33 times in 15 seconds. While this might be possible, we have no reason to believe our model is valid outside the domain and range. In fact, generally crickets stop chirping altogether below around 50 degrees.

Try It #1

According to the data from Table 1, what temperature can we predict it is if we counted 20 chirps in 15 seconds?

Finding the Line of Best Fit Using a Graphing Utility

While eyeballing a line works reasonably well, there are statistical techniques for fitting a line to data that minimize the differences between the line and data values6. One such technique is called least squares regression and can be computed by many graphing calculators, spreadsheet software, statistical software, and many web-based calculators7. Least squares regression is one means to determine the line that best fits the data, and here we will refer to this method as linear regression.

How To

Given data of input and corresponding outputs from a linear function, find the best fit line using linear regression.

  1. Enter the input in List 1 (L1).
  2. Enter the output in List 2 (L2).
  3. On a graphing utility, select Linear Regression (LinReg).

Example 4

Finding a Least Squares Regression Line

Find the least squares regression line using the cricket-chirp data in Table 2.

Analysis

Notice that this line is quite similar to the equation we “eyeballed” but should fit the data better. Notice also that using this equation would change our prediction for the temperature when hearing 30 chirps in 15 seconds from 66 degrees to:

T(30)=30.281+1.143(30)         =64.571         64.6degrees T(30)=30.281+1.143(30)         =64.571         64.6degrees

The graph of the scatter plot with the least squares regression line is shown in Figure 6.

Scatter plot, showing the line of best fit: T(c) = 30.281 + 1.143c. It is titled 'Cricket Chirps vs. Air Temperature'. The x-axis is 'c, Number of Chirps', and the y-axis is 'T(c), Temperature (F)'.
Figure 6

Q&A

Will there ever be a case where two different lines will serve as the best fit for the data?

No. There is only one best fit line.

Distinguishing Between Linear and Nonlinear Models

As we saw above with the cricket-chirp model, some data exhibit strong linear trends, but other data, like the final exam scores plotted by age, are clearly nonlinear. Most calculators and computer software can also provide us with the correlation coefficient, which is a measure of how closely the line fits the data. Many graphing calculators require the user to turn a "diagnostic on" selection to find the correlation coefficient, which mathematicians label as r r The correlation coefficient provides an easy way to get an idea of how close to a line the data falls.

We should compute the correlation coefficient only for data that follows a linear pattern or to determine the degree to which a data set is linear. If the data exhibits a nonlinear pattern, the correlation coefficient for a linear regression is meaningless. To get a sense for the relationship between the value of r r and the graph of the data, Figure 7 shows some large data sets with their correlation coefficients. Remember, for all plots, the horizontal axis shows the input and the vertical axis shows the output.

Correlation coefficients values range from -1.0 - 1.0.  Collections of dots representing an example of each kind of correlation coefficient are plotted underneath them.  The closer to 1.0 the more the points are grouped tightly to form a line in the positive direction.  The closer to -1.0 the more the points are grouped tightly to form a line in the negative direction.  The closer to 0 the points are very scattered and do not form a line.  Several shapes are displayed at the bottom row, none of which are lines, but all of them have values of 0.
Figure 7 Plotted data and related correlation coefficients. (credit: “DenisBoigelot,” Wikimedia Commons)

Correlation Coefficient

The correlation coefficient is a value, r, r, between –1 and 1.

  • r>0 r>0 suggests a positive (increasing) relationship
  • r<0 r<0 suggests a negative (decreasing) relationship
  • The closer the value is to 0, the more scattered the data.
  • The closer the value is to 1 or –1, the less scattered the data is.

Example 5

Finding a Correlation Coefficient

Calculate the correlation coefficient for cricket-chirp data in Table 1.

Fitting a Regression Line to a Set of Data

Once we determine that a set of data is linear using the correlation coefficient, we can use the regression line to make predictions. As we learned above, a regression line is a line that is closest to the data in the scatter plot, which means that only one such line is a best fit for the data.

Example 6

Using a Regression Line to Make Predictions

Gasoline consumption in the United States has been steadily increasing. Consumption data from 1994 to 2004 is shown in Table 3.8 Determine whether the trend is linear, and if so, find a model for the data. Use the model to predict the consumption in 2008.

Year '94 '95 '96 '97 '98 '99 '00 '01 '02 '03 '04
Consumption (billions of gallons) 113 116 118 119 123 125 126 128 131 133 136
Table 3

The scatter plot of the data, including the least squares regression line, is shown in Figure 8.

Scatter plot, showing the line of best fit. It is titled 'Gas Consumption VS Year'. The x-axis is 'Year After 1994', and the y-axis is 'Gas Consumption (billions of gallons)'. The points are strongly positively correlated and the line of best fit goes through most of the points completely.
Figure 8

Try It #2

Use the model we created using technology in Example 6 to predict the gas consumption in 2011. Is this an interpolation or an extrapolation?

Media

Access these online resources for additional instruction and practice with fitting linear models to data.

4.3 Section Exercises

Verbal

1.

Describe what it means if there is a model breakdown when using a linear model.

2.

What is interpolation when using a linear model?

3.

What is extrapolation when using a linear model?

4.

Explain the difference between a positive and a negative correlation coefficient.

5.

Explain how to interpret the absolute value of a correlation coefficient.

Algebraic

6.

A regression was run to determine whether there is a relationship between hours of TV watched per day (x) (x) and number of sit-ups a person can do (y). (y). The results of the regression are given below. Use this to predict the number of sit-ups a person who watches 11 hours of TV can do.

y=ax+b a=−1.341 b=32.234 r=−0.896 y=ax+b a=−1.341 b=32.234 r=−0.896
7.

A regression was run to determine whether there is a relationship between the diameter of a tree ( x x , in inches) and the tree’s age ( y y , in years). The results of the regression are given below. Use this to predict the age of a tree with diameter 10 inches.

y=ax+b a=6.301 b=−1.044 r=−0.970 y=ax+b a=6.301 b=−1.044 r=−0.970

For the following exercises, draw a scatter plot for the data provided. Does the data appear to be linearly related?

8.
0 2 4 6 8 10
–22 –19 –15 –11 –6 –2
9.
1 2 3 4 5 6
46 50 59 75 100 136
10.
100 250 300 450 600 750
12 12.6 13.1 14 14.5 15.2
11.
1 3 5 7 9 11
1 9 28 65 125 216
12.

For the following data, draw a scatter plot. If we wanted to know when the population would reach 15,000, would the answer involve interpolation or extrapolation? Eyeball the line, and estimate the answer.

Year Population
199011,500
199512,100
200012,700
200513,000
201013,750
13.

For the following data, draw a scatter plot. If we wanted to know when the temperature would reach 28°F, would the answer involve interpolation or extrapolation? Eyeball the line and estimate the answer.

Temperature,°F 16 18 20 25 30
Time, seconds 46 50 54 55 62

Graphical

For the following exercises, match each scatterplot with one of the four specified correlations in Figure 9 and Figure 10.

Side-by-side scatter plots.  The first is a scattered correlation in the positive direction.  The second is a scattered correlation in the negative direction
Figure 9
Side-by-side scatter plots.  The first has a strong negative correlation with all the points spaced out evenly near the top and center, but more spread out near the bottom.  The second has a strong positive correlation, with the points more spread out near the bottom and closer together near the center and top.
Figure 10
14.

r=0.95 r=0.95

15.

r=−0.89 r=−0.89

16.

r=−0.26 r=−0.26

17.

r=−0.39 r=−0.39

For the following exercises, draw a best-fit line for the plotted data.

18.
Scatter plot with a domain of 0 to 10 and a range of 4 to 9.  The points are at (0,5); (2.1,4.2); (3.5,6); (4.5,6.5); (5.5,6.8); (7,7.4); (8,8.5); (9,8); and (10,9).
19.
Scatter plot with a domain of 0 to 10 and a range of -1 to 4.  The points are at (0,1.5); (1.5, -0.1); (2.1,1.9); (3.4, 1.5); (4.5,2.5); (5.8,2.2); (6.8,3.8); (7.8,3.6); (8.8,2); and (10,2.4).
20.
Scatter plot with a domain of 0 to 10 and range of 0 to 7 with the points: (0,7.3); (1,7); (2.2,6); (3.6,7); (4.8,6.2); (5.8,4); (6.6,3.8); (7.9,2.4); (8.8,2); and (10,0.1).
21.
Scatter plot with a domain of 0 to 10 and a range of 2 to 6 with the points: (0,2.1); (1,3.9); (2.1,3.6); (3.6,3.9); (4.4,4); (5.6,4.2); (6.8,5); (7.8,5); (9,5.6); and (10,6).

Numeric

22.

The U.S. Census tracks the percentage of persons 25 years or older who are college graduates. That data for several years is given in Table 4.9 Determine whether the trend appears linear. If so, and assuming the trend continues, in what year will the percentage exceed 35%?

Year Percent Graduates
199021.3
199221.4
199422.2
199623.6
199824.4
200025.6
200226.7
200427.7
200628
200829.4
Table 4
23.

The U.S. import of wine (in hectoliters) for several years is given in Table 5. Determine whether the trend appears linear. If so, and assuming the trend continues, in what year will imports exceed 12,000 hectoliters?

Year Imports
19922665
19942688
19963565
19984129
20004584
20025655
20046549
20067950
20088487
20099462
Table 5
24.

Table 6 shows the year and the number of people unemployed in a particular city for several years. Determine whether the trend appears linear. If so, and assuming the trend continues, in what year will the number of unemployed reach 5?

Year Number Unemployed
1990750
1992670
1994650
1996605
1998550
2000510
2002460
2004420
2006380
2008320
Table 6

Technology

For the following exercises, use each set of data to calculate the regression line using a calculator or other technology tool, and determine the correlation coefficient to 3 decimal places of accuracy.

25.
x x 8 15 26 31 56
y y 23 41 53 72 103
26.
x x 5 7 10 12 15
y y 4 12 17 22 24
27.
x x y y x x y y
321.91018.54
422.221115.76
522.741213.68
622.261314.1
720.781414.02
817.61511.94
916.521612.76
28.
x x y y
444.8
543.1
638.8
739
838
932.7
1030.1
1129.3
1227
1325.8
29.
x x 21 25 30 31 40 50
y y 17 11 2 –1 –18 –40
30.
x x y y
1002000
801798
601589
551580
401390
201202
31.
x x 900 988 1000 1010 1200 1205
y y 70 80 82 84 105 108

Extensions

32.

Graph f(x)=0.5x+10. f(x)=0.5x+10. Pick a set of five ordered pairs using inputs x=−2,1,5,6,9 x=−2,1,5,6,9 and use linear regression to verify that the function is a good fit for the data.

33.

Graph f(x)=2x10. f(x)=2x10. Pick a set of five ordered pairs using inputs x=−2,1,5,6,9 x=−2,1,5,6,9 and use linear regression to verify the function.

For the following exercises, consider this scenario: The profit of a company decreased steadily over a ten-year span. The following ordered pairs shows dollars and the number of units sold in hundreds and the profit in thousands of over the ten-year span, (number of units sold, profit) for specific recorded years:

(46,1,600),(48,1,550),(50,1,505),(52,1,540),(54,1,495). (46,1,600),(48,1,550),(50,1,505),(52,1,540),(54,1,495).

34.

Use linear regression to determine a function P P where the profit in thousands of dollars depends on the number of units sold in hundreds.

35.

Find to the nearest tenth and interpret the x-intercept.

36.

Find to the nearest tenth and interpret the y-intercept.

Real-World Applications

For the following exercises, consider this scenario: The population of a city increased steadily over a ten-year span. The following ordered pairs shows the population and the year over the ten-year span, (population, year) for specific recorded years:

(2500,2000),(2650,2001),(3000,2003),(3500,2006),(4200,2010) (2500,2000),(2650,2001),(3000,2003),(3500,2006),(4200,2010)

37.

Use linear regression to determine a function y, y, where the year depends on the population. Round to three decimal places of accuracy.

38.

Predict when the population will hit 8,000.

For the following exercises, consider this scenario: The profit of a company increased steadily over a ten-year span. The following ordered pairs show the number of units sold in hundreds and the profit in thousands of over the ten year span, (number of units sold, profit) for specific recorded years:

(46,250),(48,305),(50,350),(52,390),(54,410). (46,250),(48,305),(50,350),(52,390),(54,410).

39.

Use linear regression to determine a function y, where the profit in thousands of dollars depends on the number of units sold in hundreds.

40.

Predict when the profit will exceed one million dollars.

For the following exercises, consider this scenario: The profit of a company decreased steadily over a ten-year span. The following ordered pairs show dollars and the number of units sold in hundreds and the profit in thousands of over the ten-year span (number of units sold, profit) for specific recorded years:

(46,250),(48,225),(50,205),(52,180),(54,165). (46,250),(48,225),(50,205),(52,180),(54,165).

41.

Use linear regression to determine a function y, where the profit in thousands of dollars depends on the number of units sold in hundreds.

42.

Predict when the profit will dip below the $25,000 threshold.

Footnotes

  • 5Selected data from http://classic.globe.gov/fsl/scientistsblog/2007/10/. Retrieved Aug 3, 2010
  • 6Technically, the method minimizes the sum of the squared differences in the vertical direction between the line and the data values.
  • 7For example, http://www.shodor.org/unchem/math/lls/leastsq.html
  • 8http://www.bts.gov/publications/national_transportation_statistics/2005/html/table_04_10.html
  • 9Based on data from http://www.census.gov/hhes/socdemo/education/data/cps/historical/index.html. Accessed 5/1/2014.
Citation/Attribution

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution License and you must attribute OpenStax.

Attribution information
  • If you are redistributing all or part of this book in a print format, then you must include on every physical page the following attribution:
    Access for free at https://openstax.org/books/college-algebra/pages/1-introduction-to-prerequisites
  • If you are redistributing all or part of this book in a digital format, then you must include on every digital page view the following attribution:
    Access for free at https://openstax.org/books/college-algebra/pages/1-introduction-to-prerequisites
Citation information

© Dec 8, 2021 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.