12.1 Linear Equations
The most basic type of association is a linear association. This type of relationship can be defined algebraically by the equations used (numerically with actual or predicted data values) or graphically from a plotted curve. Lines are classified as straight curves. Algebraically, a linear equation typically takes the form y = mx + b, where m and b are constants, x is the independent variable, and y is the dependent variable. In a statistical context, a linear equation is written in the form y = a + bx, where a and b are the constants. This form is used to help you distinguish the statistical context from the algebraic context. In the equation y = a + bx, the constant b that multiplies the x variable (b is called a coefficient) is called the slope. The slope describes the rate of change between the independent and dependent variables; in other words, the rate of change describes the change that occurs in the dependent variable as the independent variable is changed. In the equation y = a + bx, the constant a is called the y-intercept. Graphically, the y-intercept is the y-coordinate of the point where the graph of the line crosses the y-axis. At this point, x = 0.
The slope of a line is a value that describes the rate of change between the independent and dependent variables. The slope tells us how the dependent variable (y) changes for every one-unit increase in the independent (x) variable, on average. The y-intercept is used to describe the dependent variable when the independent variable equals zero. Graphically, the slope is represented by three line types in elementary statistics.
12.2 The Regression Equation
A regression line, or a line of best fit, can be drawn on a scatter plot and used to predict outcomes for the x and y variables in a given data set or sample data. There are several ways to find a regression line, but usually the least-squares regression line is used because it creates a uniform line. Residuals, also called errors, measure the distance from the actual value of y and the estimated value of y. The sum of squared errors, or SSE, when set to its minimum, calculates the points on the line of best fit. Regression lines can be used to predict values within the given set of data but should not be used to make predictions for values outside the set of data.
The correlation coefficient, r, measures the strength of the linear association between x and y. The variable r has to be between –1 and +1. When r is positive, x and y tend to increase and decrease together. When r is negative, x increases and y decreases, or the opposite occurs: x decreases and y increases. The coefficient of determination, r2, is equal to the square of the correlation coefficient. When expressed as a percentage, r2 represents the percentage of variation in the dependent variable, y, that can be explained by variation in the independent variable, x, using the regression line.
12.3 Testing the Significance of the Correlation Coefficient (Optional)
Linear regression is a procedure for fitting a straight line of the form ŷ = a + bx to data. The conditions for regression are as follows:
- Linear: In the population, there is a linear relationship that models the average value of y for different values of x.
- Independent: The residuals are assumed to be independent.
- Normal: The y values are distributed normally for any value of x.
- Equal variance: The standard deviation of the y values is equal for each x value.
- Random: The data are produced from a well-designed random sample or a randomized experiment.
The slope b and intercept a of the least-squares line estimate the slope β and intercept α of the population (true) regression line. To estimate the population standard deviation of y (σ) use the standard deviation of the residuals: . The variable ρ (rho) is the population correlation coefficient. To test the null hypothesis, H0: ρ = hypothesized value, use a linear regression t-test. The most common null hypothesis is H0: ρ = 0, which indicates there is no linear relationship between x and y in the population. The TI-83, 83+, 84, 84+ calculator function LinRegTTest can perform this test (STATS, TESTS, LinRegTTest).
12.4 Prediction (Optional)
After determining the presence of a strong correlation coefficient and calculating the line of best fit, you can use the least-squares regression line to make predictions about your data.
12.5 Outliers
To determine whether a point is an outlier, do one of the following:
- Input the following equations into the TI 83, 83+, 84, or 84+ calculator:
where s is the standard deviation of the residuals.
If any point is above y2 or below y3, then the point is considered to be an outlier.
- Use the residuals and compare their absolute values to 2s, where s is the standard deviation of the residuals. If the absolute value of any residual is greater than or equal to 2s, then the corresponding point is an outlier.
- Note: The calculator function LinRegTTest (STATS, TESTS, LinRegTTest) calculates s.