Barbara Illowsky; Susan Dean

Data rarely fit a straight line exactly. Usually, you must be satisfied with rough predictions. Typically, you have a set of data with a scatter plot that appear to fit a straight line. This is called a line of best fit or least-squares regression line.

Collaborative Exercise

If you know a person’s pinky (smallest) finger length, do you think you could predict that person’s height? Collect data from your class (pinky finger length, in inches). The independent variable, x, is pinky finger length and the dependent variable, y, is height. For each set of data, plot the points on graph paper. Make your graph big enough and use a ruler. Then, by eye, draw a line that appears to fit the data. For your line, pick two convenient points and use them to find the slope of the line. Find the y-intercept of the line by extending your line so it crosses the y-axis. Using the slopes and the y-intercepts, write your equation of best fit. Do you think everyone will have the same equation? Why or why not? According to your equation, what is the predicted height for a pinky length of 2.5 inches?

Example 12.5

A random sample of 11 statistics students produced the data in Table 12.1, where x is the third exam score out of 80 and y is the final exam score out of 200. Can you predict the final exam score of a random student if you know the third exam score?

x (third exam score)	y (final exam score)
65	175
67	133
71	185
71	163
66	126
75	198
67	153
70	163
71	159
69	151
69	159

Table 12.1

This is a scatter plot of the data provided. The third exam score is plotted on the x-axis, and the final exam score is plotted on the y-axis. The points form a strong, positive, linear pattern.

Figure 12.5 Using the x- and y-coordinates in the table, we plot the points on a graph to create the scatter plot showing the scores on the final exam based on scores from the third exam.

Try It 12.5

SCUBA divers have maximum dive times they cannot exceed when going to different depths. The data in Table 12.2 show different depths in feet, with the maximum dive times in minutes. Use your calculator to find the least squares regression line and predict the maximum dive time for 110 feet.

x (depth)	y (maximum dive time)
50	80
60	55
70	45
80	35
90	25
100	22

Table 12.2

The third exam score, x, is the independent variable, and the final exam score, y, is the dependent variable. We will plot a regression line that best fits the data. If each of you were to fit a line by eye, you would draw different lines. We can obtain a line of best fit using either the median-–median line approach or by calculating the least-squares regression line.

Let'’s first find the line of best fit for the relationship between the third exam score and the final exam score using the median-median line approach. Remember that this is the data from Example 12.5 after the ordered pairs have been listed by ordering x values. If multiple data points have the same y values, then they are listed in order from least to greatest y (see data values where x = 71). We first divide our scores into three groups of approximately equal numbers of x values per group. The first and third groups have the same number of x values. We must remember first to put the x values in ascending order. The corresponding y values are then recorded. However, to find the median, we first must rearrange the y values in each group from the least value to the greatest value. Table 12.3 shows the correct ordering of the x values but does not show a reordering of the y values.

x (third exam score)	y (final exam score)
65	175
66	126
67	133
67	153
69	151
69	159
70	163
71	159
71	163
71	185
75	198

Table 12.3

With this set of data, the first and last groups each have four x values and four corresponding y values. The second group has three x values and three corresponding y values. We need to organize the x and y values per group and find the median x and y values for each group. Let’s now write out our y values for each group in ascending order. For group 1, the y values in order are 126, 133, 153, and 175. For group 2, the y values are already in order. For group 3, the y values are also already in order. We can represent these data as shown in Table 12.4, but notice that we have broken the ordered pairs; (65, 126) is not a data point in our original set:

Group	x (third exam score)	y (final exam score)	Median x value	Median y value
1	65 66 67 67	126 133 153 175	66.5	143
2	69 69 70	151 159 163	69	159
3	71 71 71 75	159 163 185 198	71	174

Table 12.4

When this is completed, we can write the ordered pairs for the median values. This allows us to find the slope and y-intercept of the –median-median line.

The ordered pairs are (66.5, 143), (69, 159), and (71, 174).

The slope can be calculated using the formula $m - \frac{y_{2} - y_{1}}{x_{2} - x_{1}} .$ Substituting the median x and y values from the first and third groups gives $m = \frac{174 - 143}{71 - 66.5},$ which simplifies to $m \approx 6.9 .$

The y-intercept may be found using the formula $b = \frac{Σ y - m Σ x}{3}$ , which means the quantity of the sum of the median y values minus the slope times the sum of the median x values divided by three.

The sum of the median x values is 206.5, and the sum of the median y values is 476. Substituting these sums and the slope into the formula gives $b = \frac{476 - 6.9 (206.5)}{3}$ , which simplifies to $b \approx - 316.3 .$

The line of best fit is represented as $y = m x + b .$

Thus, the equation can be written as y = 6.9x − 316.3.

The median–median line may also be found using your graphing calculator. You can enter the x and y values into two separate lists; choose Stat, Calc, Med-Med, and press Enter. The slope, a, and y-intercept, b, will be provided. The calculator shows a slight deviation from the previous manual calculation as a result of rounding. Rounding to the nearest tenth, the calculator gives the –median-median line of $y = 6.9 x - 315.5 .$ Each point of data is of the the form (x, y), and each point of the line of best fit using least-squares linear regression has the form (x, ŷ).

The ŷ is read y hat and is the estimated value of y. It is the value of y obtained using the regression line. It is not generally equal to y from data, but it is still important because it can help make predictions for other values.

The scatter plot of exam scores with a line of best fit. One data point is highlighted along with the corresponding point on the line of best fit. Both points have the same x-coordinate. The distance between these two points illustrates how to compute the sum of squared errors.

Figure 12.6

The term y₀ – ŷ₀ = ε₀ is called the error or residual. It is not an error in the sense of a mistake. The absolute value of a residual measures the vertical distance between the actual value of y and the estimated value of y. In other words, it measures the vertical distance between the actual data point and the predicted point on the line, or it measures how far the estimate is from the actual data value.

If the observed data point lies above the line, the residual is positive and the line underestimates the actual data value for y. If the observed data point lies below the line, the residual is negative and the line overestimates that actual data value for y.

In Figure 12.6, y₀ – ŷ₀ = ε₀ is the residual for the point shown. Here the point lies above the line and the residual is positive.

ε = the Greek letter epsilon

For each data point, you can calculate the residuals or errors, y_i – ŷ_i = ε_i for i = 1, 2, 3, . . . , 11.

Each |ε| is a vertical distance.

For the example about the third exam scores and the final exam scores for the 11 statistics students, there are 11 data points. Therefore, there are 11 ε values. If you square each ε and add them, you get the sum of ε squared from i = 1 to i = 11, as shown below.

${(ε_{1})}^{2} + {(ε_{2})}^{2} + ... + {(ε_{11})}^{2} = \overset{11}{\underset{i = 1}{Σ}} ε^{2} .$

This is called the sum of squared errors (SSE).

Using calculus, you can determine the values of a and b that make the SSE a minimum. When you make the SSE a minimum, you have determined the points that are on the line of best fit. It turns out that the line of best fit has the equation

ŷ = a + b x

where

$a = \bar{y} - b \bar{x}$

and $b = \frac{\sum (x - \bar{x}) (y - \bar{y})}{\sum {(x - \bar{x})}^{2}}$ .

The sample means of the x values and the y values are $\bar{x}$ and $\bar{y}$ , respectively. The best-fit line always passes through the point $(\bar{x}, \bar{y})$ .

The slope (b) can be written as $b = r (\frac{s_{y}}{s_{x}})$ where s_y = the standard deviation of the y values and s_x = the standard deviation of the x values. r is the correlation coefficient, which shows the relationship between the x and y values. This will be discussed in more detail in the next section.

Least-Squares Criteria for Best Fit

The process of fitting the best-fit line is called linear regression. We assume that the data are scattered about a straight line. To find that line, we minimize the sum of the squared errors (SSE), or make it as small as possible. Any other line you might choose would have a higher SSE than the best-fit line. This best-fit line is called the least-squares regression line.

Note

Computer spreadsheets, statistical software, and many calculators can quickly calculate the best-fit line and create the graphs. The calculations tend to be tedious if done by hand. Instructions to use the TI-83, TI-83+, and TI-84+ calculators to find the best-fit line and create a scatter plot are shown at the end of this section.

Third Exam vs. Final Exam Example

The graph of the line of best fit for the third exam/final exam example is as follows:

The scatter plot of exam scores with a line of best fit. One data point is highlighted along with the corresponding point on the line of best fit.

Figure 12.7

The least-squares regression line (best-fit line) for the third exam/final exam example has the equation

ŷ = −173.51 + 4.83 x .

Understanding and Interpreting the y-intercept

The y-intercept, a, of the line describes where the plot line crosses the y-axis. The y-intercept of the best-fit line tells us the best value of the relationship when x is zero. In some cases, it does not make sense to figure out what y is when x = 0. For example, in the third exam vs. final exam example, the y-intercept occurs when the third exam score, or x, is zero. Since all the scores are grouped around a passing grade, there is no need to figure out what the final exam score, or y, would be when the third exam was zero.

However, the y-intercept is very useful in many cases. For many examples in science, the y-intercept gives the baseline reading when the experimental conditions aren’'t applied to an experimental system. This baseline indicates how much the experimental condition affects the system. It could also be used to ensure that equipment and measurements are calibrated properly before starting the experiment.

In biology, the concentration of proteins in a sample can be measured using a chemical assay that changes color depending on how much protein is present. The more protein present, the darker the color. The amount of color can be measured by the absorbance reading. Table 12.5 shows the expected absorbance readings at different protein concentrations. This is called a standard curve for the assay.

Concentration (mM)	Absorbance (mAU)
125	0.021
250	0.023
500	0.068
750	0.086
1,000	0.105
1,500	0.124
2,000	0.146

Table 12.5

The scatter plot Figure 12.8 includes the line of best fit.

This shows a scatter plot with a line of best fit. The scatter plot has points plotted at (0.021, 125), (0.023, 250), (0.068, 500), (0.086, 750), (0.105, 1000), (0.124, 1500), and (0.146, 2000), and is labeled y = 7E-05x + 0.0226.

Figure 12.8

The y-intercept of this line occurs at 0.0226 mAU. This means the assay gives a reading of 0.0226 mAU when there is no protein present. That is, it is the baseline reading that can be attributed to something else, which, in this case, is some other non-protein chemicals that are absorbing light. We can tell that this line of best fit is reasonable because the y-intercept is small, close to zero. When there is no protein present in the sample, we expect the absorbance to be very small, or close to zero, as well.

Understanding Slope

The slope of the line, b, describes how changes in the variables are related. It is important to interpret the slope of the line in the context of the situation represented by the data. You should be able to write a sentence interpreting the slope in plain English.

Interpretation of the Slope: The slope of the best-fit line tells us how the dependent variable (y) changes for every one unit increase in the independent (x) variable, on average.

Third Exam vs. Final Exam ExampleSlope: The slope of the line is b = 4.83.
Interpretation: For a 1-point increase in the score on the third exam, the final exam score increases by 4.83 points, on average.

Using the TI-83, 83+, 84, 84+ Calculator

Using the Linear Regression T Test: LinRegTTest

In the STAT list editor, enter the x data in list L1 and the y data in list L2, paired so that the corresponding (x, y) values are next to each other in the lists. (If a particular pair of values is repeated, enter it as many times as it appears in the data.)
On the STAT TESTS menu, scroll down and select LinRegTTest. (Be careful to select LinRegTTest. Some calculators may also have a different item called LinRegTInt.)
On the LinRegTTest input screen, enter Xlist: L1, Ylist: L2, and Freq: 1.
On the next line, at the prompt β or ρ, highlight ≠ 0 and press ENTER.
Leave the line for RegEQ: blank.
Highlight Calculate and press ENTER.

1. Image of calculator input screen for LinRegTTest with input matching the instructions above. 2.Image of corresponding output calculator output screen for LinRegTTest: Output screen shows: Line 1. LinRegTTest; Line 2. y = a + bx; Line 3. beta does not equal 0 and rho does not equal 0; Line 4. t = 2.657560155; Line 5. df = 9; Line 6. a = 173.513363; Line 7. b = 4.827394209; Line 8. s = 16.41237711; Line 9. r squared = .4396931104; Line 10. r = .663093591

Figure 12.9

The output screen contains a lot of information. For now, let’s focus on a few items from the output and return to the other items later.
The second line says y = a + bx. Scroll down to find the values a = –173.513 and b = 4.8273.

The equation of the best-fit line is ŷ = –173.51 + 4.83x.
The two items at the bottom are r² = .43969 and r = .663. For now, just note where to find these values; we examine them in the next two sections.

Graphing the Scatter Plot and Regression Line

We are assuming the x data are already entered in list L1 and the y data are in list L2.
Press 2nd STATPLOT ENTER to use Plot 1.
On the input screen for PLOT 1, highlight On, and press ENTER.
For TYPE, highlight the first icon, which is the scatter plot, and press ENTER.
Indicate Xlist: L1 and Ylist: L2.
For Mark, it does not matter which symbol you highlight.
Press the ZOOM key and then the number 9 (for menu item ZoomStat); the calculator fits the window to the data.
To graph the best-fit line, press the Y= key and type the equation –173.5 + 4.83X into equation Y1. (The X key is immediately left of the STAT key.) Press ZOOM 9 again to graph it.
Optional: If you want to change the viewing window, press the WINDOW key. Enter your desired window using Xmin, Xmax, Ymin, and Ymax.

NOTE

Another way to graph the line after you create a scatter plot is to use LinRegTTest.

Make sure you have done the scatter plot. Check it on your screen.
Go to LinRegTTest and enter the lists.
At RegEq, press VARS and arrow over to Y-VARS. Press 1 for 1:Function. Press 1 for 1:Y1. Then, arrow down to Calculate and do the calculation for the line of best fit.
Press Y= (you will see the regression equation).
Press GRAPH, and the line will be drawn.

The Correlation Coefficient r

Besides looking at the scatter plot and seeing that a line seems reasonable, how can you determine whether the line is a good predictor? Use the correlation coefficient as another indicator (besides the scatter plot) of the strength of the relationship between x and y.

The correlation coefficient, r, developed by Karl Pearson during the early 1900s, is numeric and provides a measure of the strength and direction of the linear association between the independent variable x and the dependent variable y.

If you suspect a linear relationship between x and y, then r can measure the strength of the linear relationship.

What the Value of r Tells Us

The value of r is always between –1 and +1. In other words, –1 ≤ r ≤ 1.
The size of the correlation r indicates the strength of the linear relationship between x and y. Values of r close to –1 or to +1 indicate a stronger linear relationship between x and y.
If r = 0, there is absolutely no linear relationship between x and y (no linear correlation).
If r = 1, there is perfect positive correlation. If r = –1, there is perfect negative correlation. In both these cases, all the original data points lie on a straight line. Of course, in the real world, this does not generally happen.

What the Sign of r Tells Us

A positive value of r means that when x increases, y tends to increase and when x decreases, y tends to decrease (positive correlation).
A negative value of r means that when x increases, y tends to decrease and when x decreases, y tends to increase (negative correlation).
The sign of r is the same as the sign of the slope, b, of the best-fit line.

Note

A strong correlation does not suggest that x causes y or y causes x. We say correlation does not imply causation.

The correlation coefficient is calculated as the quantity of data points times the sum of the quantity of the x-coordinates times the y-coordinates, minus the quantity of the sum of the x-coordinates times the sum of the y-coordinates, all divided by the square root of the quantity of data points times the sum of the x-coordinates squared minus the square of the sum of the x-coordinates, times the number of data points times the sum of the y-coordinates squared minus the square of the sum of the y-coordinates. It can be summarized by the following equation:

r = \frac{n Σ (x y) - (Σ x) (Σ y)}{\sqrt{[n Σ x^{2} - {(Σ x)}^{2}] [n Σ y^{2} - {(Σ y)}^{2}]}}

where n is the number of data points.

Three scatter plots with lines of best fit. The first scatterplot shows points ascending from the lower left to the upper right. The line of best fit has positive slope. The second scatter plot shows points descending from the upper left to the lower right. The line of best fit has negative slope. The third scatter plot of points form a horizontal pattern. The line of best fit is a horizontal line.

Figure 12.10 (a) A scatter plot showing data with a positive correlation: 0 < r < 1. (b) A scatter plot showing data with a negative correlation: –1 < r < 0. (c) A scatter plot showing data with zero correlation: r = 0.

The formula for r looks formidable. However, computer spreadsheets, statistical software, and many calculators can calculate r quickly. The correlation coefficient, r, is the bottom item in the output screens for the LinRegTTest on the TI-83, TI-83+, or TI-84+ calculator (see previous section for instructions).

The Coefficient of Determination

The variable r² is called the coefficient of determination and it is the square of the correlation coefficient, but it is usually stated as a percentage, rather than in decimal form. It has an interpretation in the context of the data:

$r^{2},$ when expressed as a percent, represents the percentage of variation in the dependent (predicted) variable y that can be explained by variation in the independent (explanatory) variable x using the regression (best-fit) line.
1 – $r^{2},$ when expressed as a percentage, represents the percentage of variation in y that is not explained by variation in x using the regression line. This can be seen as the scattering of the observed data points about the regression line.

Consider the third exam/final exam example introduced in the previous section.

The line of best fit is: ŷ = –173.51 + 4.83x.
The correlation coefficient is r = .6631.
The coefficient of determination is r² = .6631² = .4397.

Interpret r² in the context of this example.

Approximately 44 percent of the variation (0.4397 is approximately 0.44) in the final exam grades can be explained by the variation in the grades on the third exam, using the best-fit regression line.
Therefore, the rest of the variation (1 – 0.44 = 0.56 or 56 percent) in the final exam grades cannot be explained by the variation of the grades on the third exam with the best-fit regression line. These are the variation of the points that are not as close to the regression line as others.

12.2 The Regression Equation