12.1 Linear Equations
The most basic type of association is a linear association. This type of relationship can be defined algebraically by the equations used, numerically with actual or predicted data values, or graphically from a plotted curve. (Lines are classified as straight curves.) Algebraically, a linear equation typically takes the form y = mx + b, where m and b are constants, x is the independent variable, y is the dependent variable. In a statistical context, a linear equation is written in the form y = a + bx, where a and b are the constants. This form is used to help readers distinguish the statistical context from the algebraic context. In the equation y = a + bx, the constant b, called a coefficient, represents the slope. The constant a is called the y-intercept.
The slope of a line is a value that describes the rate of change between the independent and dependent variables. The slope tells us how the dependent variable (y) changes for every one unit increase in the independent (x) variable, on average. The y-intercept is used to describe the dependent variable when the independent variable equals zero.
12.2 Scatter Plots
Scatter plots are particularly helpful graphs when we want to see if there is a linear relationship among data points. They indicate both the direction of the relationship between the x variables and the y variables, and the strength of the relationship. We calculate the strength of the relationship between an independent variable and a dependent variable using linear regression.
12.3 The Regression Equation
A regression line, or a line of best fit, can be drawn on a scatter plot and used to predict outcomes for the x and y variables in a given data set or sample data. There are several ways to find a regression line, but usually the least-squares regression line is used because it creates a uniform line. Residuals, also called “errors,” measure the distance from the actual value of y and the estimated value of y. The Sum of Squared Errors, when set to its minimum, calculates the points on the line of best fit. Regression lines can be used to predict values within the given set of data, but should not be used to make predictions for values outside the set of data.
The correlation coefficient r measures the strength of the linear association between x and y. The variable r has to be between –1 and +1. When r is positive, the x and y will tend to increase and decrease together. When r is negative, x will increase and y will decrease, or the opposite, x will decrease and y will increase. The coefficient of determination r2, is equal to the square of the correlation coefficient. When expressed as a percent, r2 represents the percent of variation in the dependent variable y that can be explained by variation in the independent variable x using the regression line.
12.4 Testing the Significance of the Correlation Coefficient
Linear regression is a procedure for fitting a straight line of the form ŷ = a + bx to data. The conditions for regression are:
- Linear In the population, there is a linear relationship that models the average value of y for different values of x.
- Independent The residuals are assumed to be independent.
- Normal The y values are distributed normally for any value of x.
- Equal variance The standard deviation of the y values is equal for each x value.
- Random The data are produced from a well-designed random sample or randomized experiment.
The slope b and intercept a of the least-squares line estimate the slope β and intercept α of the population (true) regression line. To estimate the population standard deviation of y, σ, use the standard deviation of the residuals, s. . The variable ρ (rho) is the population correlation coefficient. To test the null hypothesis H0: ρ = hypothesized value, use a linear regression t-test. The most common null hypothesis is H0: ρ = 0 which indicates there is no linear relationship between x and y in the population. The TI-83, 83+, 84, 84+ calculator function LinRegTTest can perform this test (STATS TESTS LinRegTTest).
12.5 Prediction
After determining the presence of a strong correlation coefficient and calculating the line of best fit, you can use the least squares regression line to make predictions about your data.
12.6 Outliers
To determine if a point is an outlier, do one of the following:
- Input the following equations into the TI 83, 83+,84, 84+:
where s is the standard deviation of the residuals
If any point is above y2 or below y3 then the point is considered to be an outlier.
- Use the residuals and compare their absolute values to 2s where s is the standard deviation of the residuals. If the absolute value of any residual is greater than or equal to 2s, then the corresponding point is an outlier.
- Note: The calculator function LinRegTTest (STATS TESTS LinRegTTest) calculates s.