Learning Outcomes
By the end of this section, you should be able to:
- 4.3.1 Create scatterplots and calculate and interpret correlation coefficients.
- 4.3.2 Perform linear regression and determine the best-fit linear equation.
- 4.3.3 Use Python to calculate correlation coefficients and determine equations of linear regression models.
We briefly introduced correlation analysis at the beginning of this chapter, but now we want to dig in a little deeper. Data scientists are often interested in knowing if there are relationships, or a correlation, between two numeric quantities. For example, a medical researcher might be interested in investigating if there is a relationship between age and blood pressure. A college professor might be interested to know if there is a relationship between time spent on social media by students and corresponding academic grades.
Correlation analysis allows for the determination of a statistical relationship between two numeric quantities, or variables—an independent variable and a dependent variable. The independent variable is the variable that you can change or control, while the dependent variable is what you measure or observe to see how it responds to those changes. Regression analysis takes this a step further by quantifying the relationship between the two variables, and it can be used to predict one quantity based on a second quantity, assuming there is a significant correlation between the two quantities. For example, the value of a car can be predicted based on the age of the car.
Earlier we noted many business-related applications where correlation and regression analysis are used. For instance, regression analysis can be used to establish a mathematical equation that relates a dependent variable (such as sales) to an independent variable (such as advertising expenditure). But it has many applications apart from business as well. An automotive engineer is interested in the correlation between outside temperature and battery life for an electric vehicle, for instance.
Our discussion here will focus on linear regression—analyzing the relationship between one dependent variable and one independent variable, where the relationship can be modeled using a linear equation.
Scatterplots and Correlation
In correlation analysis, we study the relationship between bivariate data, which is data collected on two variables where the data values are paired with one another. Correlation measures the association between two numeric variables. We may be interested in knowing if there is a correlation between bond prices and interest rates or between the age of a car and the value of the car. To investigate the correlation between two numeric quantities, the first step is to collect data for the two numeric quantities of interest and then create a scatterplot that will graph the ordered pairs. The independent, or explanatory, quantity is labeled the -variable, and the dependent, or response, quantity is labeled the -variable.
For example, let’s say that a financial analyst wants to know if the price of Nike stock is correlated with the value of the S&P 500 (Standard & Poor’s 500 stock market index). To investigate this, monthly data can be collected for Nike stock prices and value of the S&P 500 for a period of time, and a scatterplot can be created and examined. A scatterplot, or scatter diagram, is a graphical display intended to show the relationship between two variables. The setup of the scatterplot is that one variable is plotted on the horizontal axis and the other variable is plotted on the vertical axis. Each pair of data values is considered as an point, and the various points are plotted on the diagram. A visual inspection of the plot is then made to detect any patterns or trends on the scatter diagram. Table 4.14 shows the relationship between the Nike stock price and its S&P value on a monthly basis over a one-year time period.
Date | S&P 500 () | Nike Stock Price () |
---|---|---|
Month 1 | 2912.43 | 87.18 |
Month 2 | 3044.31 | 98.58 |
Month 3 | 3100.29 | 98.05 |
Month 4 | 3271.12 | 97.61 |
Month 5 | 3500.31 | 111.89 |
Month 6 | 3363.00 | 125.54 |
Month 7 | 3269.96 | 120.08 |
Month 8 | 3621.63 | 134.70 |
Month 9 | 3756.07 | 141.47 |
Month 10 | 3714.24 | 133.59 |
Month 11 | 3811.15 | 134.78 |
Month 12 | 3943.34 | 140.45 |
To assess linear correlation, examine the graphical trend of the data points on the scatterplot to determine if a straight-line pattern exists (see Figure 4.5). If a linear pattern exists, the correlation may indicate either a positive or a negative correlation. A positive correlation indicates that as the independent variable increases, the dependent variable tends to increase as well, or as the independent variable decreases, the dependent variable tends to decrease (the two quantities move in the same direction). A negative correlation indicates that as the independent variable increases, the dependent variable decreases, or as the independent variable decreases, the dependent variable increases (the two quantities move in opposite directions). If there is no relationship or association between the two quantities, where one quantity changing does not affect the other quantity, we conclude that there is no correlation between the two variables.
From the scatterplot in the Nike stock versus S&P 500 example, we note that the trend reflects a positive correlation in that as the value of the S&P 500 increases, the price of Nike stock tends to increase as well.
When inspecting a scatterplot, it may be difficult to assess a correlation based on a visual inspection of the graph alone. A more precise assessment of the correlation between the two quantities can be obtained by calculating the numeric correlation coefficient (referred to using the symbol ).
The correlation coefficient is a measure of the strength and direction of the correlation between the independent variable and the dependent variable .
The formula for is shown; however, software is typically used to calculate the correlation coefficient.
where refers to the number of data pairs and the symbol indicates to sum the x-values.
Note that this value of “” is sometimes referred to as the “Pearson correlation coefficient.”
Table 4.15 provides a step-by-step procedure on how to calculate the correlation coefficient .
Step | Representation in Symbols |
---|---|
1. Calculate the sum of the x-values. | |
2. Calculate the sum of the y-values. | |
3. Multiply each x-value by the corresponding y-value and calculate the sum of these products. | |
4. Square each x-value and then calculate the sum of these squared values. | |
5. Square each y-value and then calculate the sum of these squared values. | |
6. Determine the value of , which is the number of data pairs. | n |
7. Use these results to then substitute into the formula for the correlation coefficient. |
Note that since is calculated using sample data, is considered a sample statistic and is used to measure the strength of the correlation for the two population variables. Recall that sample data indicates data based on a subset of the entire population.
As mentioned, given the complexity of this calculation, software is typically used to calculate the correlation coefficient. There are several options for calculating the correlation coefficient in Python. The example shown uses the scipy.stats
library in Python which includes the function pearsonr()
for calculating the Pearson correlation coefficient ().
We will create two numerical arrays in Python using the np.array()
function and then use the pearsonr()
function to calculate the Pearson correlation coefficient for the data shown in Table 4.15.
The following Python code provides the correlation coefficient for the Nike Stock Price dataset as 0.923.
Python Code
# import libraries
import numpy as np
from scipy.stats import pearsonr
# establish x and y data, x-data is S&P500 value, y-data is Nike Stock Price
SP500 = np.array([2912.43, 3044.31, 3100.29, 3271.12, 3500.31, 3363.00, 3269.96, 3621.63, 3756.07, 3714.24, 3811.15, 3943.34])
Nike = np.array([87.18, 98.58, 98.05, 97.61, 111.89, 125.54, 120.08, 134.70, 141.47, 133.59, 134.78, 140.45])
#pearson r returns both value of r and corresponding p-value
r, p = pearsonr(SP500, Nike)
#print value of r , rounded to 3 decimal places
print("Correlation coefficient: ", round(r, 3))
The resulting output will look like this:
Correlation coefficient: 0.923
Interpret a Correlation Coefficient
Once the value of is calculated, this measurement provides two indicators for the correlation:
- the strength of the correlation based on the value of
- the direction of the correlation based on the sign of
The value of gives us this information:
- The value of is always between and : .
- The size of the correlation indicates the strength of the linear relationship between the two variables. Values of close to or to indicate a stronger linear relationship.
- If , there is no linear relationship between the two variables (no linear correlation).
- If , there is perfect positive correlation. If , there is perfect negative correlation. In both of these cases, all the original data points lie on a straight line.
The sign of gives us this information:
- A positive value of means that when increases, tends to increase, and when decreases, tends to decrease (positive correlation).
- A negative value of means that when increases, tends to decrease, and when decreases, tends to increase (negative correlation).
From the statistical results shown in the Python output, the correlation coefficient is 0.923, which indicates that the relationship between Nike stock and the value of the S&P 500 over this time period represents a strong, positive correlation.
Test a Correlation Coefficient for Significance
The correlation coefficient, , tells us about the strength and direction of the linear relationship between and . The sample data are used to compute , the correlation coefficient for the sample. If we had data for the entire population (that is, all measurements of interest), we could find the population correlation coefficient, which is labeled as the Greek letter (pronounced “rho”). But because we have only sample data, we cannot calculate the population correlation coefficient. The sample correlation coefficient, , is our estimate of the unknown population correlation coefficient.
An important step in the correlation analysis is to determine if the correlation is significant. By this, we are asking if the correlation is strong enough to allow meaningful predictions for based on values of . One method to test the significance of the correlation is to employ a hypothesis test. The hypothesis test lets us decide whether the value of the population correlation coefficient is close to zero or significantly different from zero. We decide this based on the sample correlation coefficient and the sample size n.
If the test concludes that the correlation coefficient is significantly different from zero, we say that the correlation coefficient is significant.
- Conclusion: There is sufficient evidence to conclude that there is a significant linear relationship between the two variables because the correlation coefficient is significantly different from zero.
- What the conclusion means: There is a significant linear relationship between the two variables. If the test concludes that the correlation coefficient is not significantly different from zero (it is close to zero), we say that correlation coefficient is not significant.
A hypothesis test can be performed to test if the correlation is significant. A hypothesis test is a statistical method that uses sample data to test a claim regarding the value of a population parameter. In this case, the hypothesis test will be used to test the claim that the population correlation coefficient is equal to zero.
Use these hypotheses when performing the hypothesis test:
- Null hypothesis:
- Alternate hypothesis:
The hypotheses can be stated in words as follows:
- Null hypothesis : The population correlation coefficient is not significantly different from zero. There is not a significant linear relationship (correlation) between and in the population.
- Alternate hypothesis : The population correlation coefficient is significantly different from zero. There is a significant linear relationship (correlation) between and in the population.
A quick shorthand way to test correlations is the relationship between the sample size and the correlation. If , then this implies that the correlation between the two variables demonstrates that a linear relationship exists and is statistically significant at approximately the 0.05 level of significance. As the formula indicates, there is an inverse relationship between the sample size and the required correlation for significance of a linear relationship. With only 10 observations, the required correlation for significance is 0.6325; for 30 observations, the required correlation for significance decreases to 0.3651; and at 100 observations, the required level is only 0.2000.
Interpreting the Significance of the Sample Correlation Coefficient
- If is significant and the scatterplot shows a linear trend, the line can be used to predict the value of for values of that are within the domain of observed x-values.
- If is not significant OR if the scatterplot does not show a linear trend, the line should not be used for prediction.
- If is significant and the scatterplot shows a linear trend, the line may not be appropriate or reliable for prediction outside the domain of observed x-values in the data.
Example 4.22
Problem
Suppose that the chief financial officer (CFO) of a corporation is investigating the correlation between stock prices and unemployment rate over a period of 10 years and finds the correlation coefficient to be . There are 10 data points in the dataset. Should the CFO conclude that the correlation is significant for the relationship between stock prices and unemployment rate based on a level of significance of 0.05?
Solution
When using a level of significance of 0.05, if , then this implies that the correlation between the two variables demonstrates a linear relationship that is statistically significant at approximately the 0.05 level of significance. In this example, we check if is greater than or equal to where . Since is approximately 0.632, this indicates that the r-value of is greater than , and thus the correlation is deemed significant.
Linear Regression
A regression model is typically developed once the correlation coefficient has been calculated and a determination has been made that the correlation is significant. The regression model can then be used to predict one quantity (the dependent variable) based on the other (the independent variable). For example, a business may want to establish a correlation between the amount the company spent on advertising (the independent variable) versus its recorded sales (the dependent variable). If a strong enough correlation is established, then the business manager can predict sales based on the amount spent on advertising for a given period. In this discussion we will focus on linear regression, where a straight line is used to model the relationship between the two variables. Once a straight-line model is developed, this model can then be used to predict the value of the dependent variable for a specific value of the independent variable.
Method of Least Squares and Residuals
To create a linear model that best fits the points on a scatterplot, researchers generally use the least squares method based on a mathematical algorithm designed to minimize the squared distances from the data points to the fitted line.
Recall from algebra that the equation of a straight line is given by:
where is the slope of the line and is the y-intercept of the line.
The slope measures the steepness of the line, and the y-intercept is that point on the -axis where the graph crosses, or intercepts, the -axis.
In linear regression analysis, the equation of the straight line is written in a slightly different way using the model
In this format, is the slope of the line, and is the y-intercept of the line. The notation is called -hat and is used to indicate a predicted value of the dependent variable for a certain value of the independent variable .
If a line extends uphill from left to right, the slope is a positive value (see Figure 4.6; if the line extends downhill from left to right, the slope is a negative value).
When generating the equation of a line in algebra using , two points were required to generate the equation. However, in regression analysis, all the points in the dataset will be utilized to develop the linear regression model.
The first step in any regression analysis is to create the scatterplot. Then proceed to calculate the correlation coefficient and check this value for significance. If we think that the points show a linear relationship, we would like to draw a line on the scatterplot. This line can be calculated through a process called linear regression. However, we only calculate a regression line if one of the variables helps to explain or predict the other variable. If is the independent variable and the dependent variable, then we can use a regression line to predict for a given value of .
As an example of a regression equation, assume that a correlation exists between the monthly amount spent on advertising and the monthly revenue for a Fortune 500 company. After collecting data for a certain time period, the company determines the regression equation is of the form
where represents the monthly amount spent on advertising (in thousands of dollars), represents the monthly revenues for the company (in thousands of dollars), the slope is 61.8, and the y-intercept is 9376.7.
A scatterplot of the data is shown in Figure 4.7.
The company would like to predict the monthly revenue if its executives decide to spend $150,000 on advertising next month. To determine the estimate of monthly revenue, let in the regression equation and calculate a corresponding value for :
This predicted value of indicates that the anticipated revenue would be $18,646,700, given the advertising spend of $150,000.
Notice that from past data, there may have been a month where the company actually did spend $150,000 on advertising, and thus the company may have an actual result for the monthly revenue. This actual, or observed, amount can be compared to the prediction from the linear regression model to calculate what is called a residual.
A residual is the difference between an observed y-value and the predicted y-value obtained from the linear regression equation. As an example, assume that in a previous month, the actual monthly revenue for an advertising spend of $150,000 was $19,200,000, and thus . The residual for this data point can be calculated as follows:
Notice that residuals can be positive, negative, or zero. If the observed y-value exactly matches the predicted y-value, then the residual will be zero. If the observed y-value is greater than the predicted y-value, then the residual will be a positive value. If the observed y-value is less than the predicted y-value, then the residual will be a negative value.
When formulating the best-fit linear regression line to the points on the scatterplot, the mathematical analysis generates a linear equation where the sum of the squared residuals is minimized. This analysis is referred to as the method of least squares. The result is that the analysis generates a linear equation that is the “best fit” to the points on the scatterplot, in the sense that the line minimizes the differences between the predicted values and observed values for ; this is generally referred to as the best-fit linear equation.
Example 4.23
Problem
Suppose that the chief financial officer of a corporation has created a linear model for the relationship between the company stock and interest rates. When interest rates are at 5%, the company stock has a value of $94. Using the linear model, when interest rates are at 5%, the model predicts the value of the company stock to be $99. Calculate the residual for this data point.
Solution
A residual is the difference between an observed y-value and the predicted y-value obtained from the linear regression equation:
The goal in regression analysis is to determine the coefficients and in the following regression equation:
Once the has been collected, the slope () and y-intercept () can be calculated using the following formulas:
where refers to the number of data pairs and indicates sum of the x-values.
Notice that the formula for the y-intercept requires the use of the slope result (b), and thus the slope should be calculated first, and the y-intercept should be calculated second.
When making predictions for , it is always important to plot a scatter diagram first. If the scatterplot indicates that there is a linear relationship between the variables, then it is reasonable to use a best-fit line to make predictions for , given is within the domain of x-values in the sample data, but not necessarily for x-values outside that domain.
Using Technology for Linear Regression
Typically, technology is used to calculate the best-fit linear model as well as calculate correlation coefficients and scatterplot. Details of using Python for these calculations are provided in Using Python for Correlation and Linear Regression.
Assumptions for Linear Regression
Testing the significance of the correlation coefficient requires that certain assumptions about the data are satisfied. The premise of this test is that the data are a sample of observed points taken from a larger population. We have not examined the entire population because it is not possible or feasible to do so. We are examining the sample to draw a conclusion about whether the linear relationship that we see between and data in the sample data provides strong enough evidence that we can conclude that there actually is a linear relationship between and data in the population.
The regression line equation that we calculate from the sample data gives the best-fit line for our particular sample. We want to use this best-fit line for the sample as an estimate of the best-fit line for the population (Figure 4.8). Examining the scatterplot and testing the significance of the correlation coefficient helps us determine if it is appropriate to do this.
These are the assumptions underlying the test of significance:
- There is a linear relationship in the population that models the average value of for varying values of . In other words, the expected value of for each particular value lies on a straight line in the population. (We do not know the equation for the line for the population. Our regression line from the sample is our best estimate of this line in the population.)
- The y-values for any particular x-value are normally distributed about the line. This implies that there are more y-values scattered closer to the line than are scattered farther away. Assumption (1) implies that these normal distributions are centered on the line: the means of these normal distributions of y-values lie on the line.
- The standard deviations of the population y-values about the line are equal for each value of . In other words, each of these normal distributions of y-values has the same shape and spread about the line.
- The residual errors are mutually independent (no pattern).
- The data are produced from a well-designed random sample or randomized experiment.
Using Python for Correlation and Linear Regression
Once a correlation has been deemed significant, a linear regression model is developed. The goal in the regression analysis is to determine the coefficients and in the following regression equation:
These formulas can be quite cumbersome, especially for a significant number of data pairs, and thus software is often used (such as Excel, Python, or R).
Develop a Linear Regression Model
In the following example, Python will be used for the following analysis based on a given dataset of data.
- Create a scatterplot.
- Calculate the correlation coefficient.
- Construct the best-fit linear equation.
- Predict values of the dependent variable for given values of the independent variable.
There are various ways to accomplish these tasks in Python, but we will use several functions available within the scipy
library, such as:
Pearsonr()
for correlation coefficient
linregress()
for linear regression model
plt.scatter()
for scatterplot generation
Example 4.24
Problem
A marketing manager is interested in studying the correlation between the amount spent on advertising and revenue at a certain company. Twelve months of data were collected as shown in Table 4.16. (Dollar amounts are in thousands of dollars.)
Use Python to perform the following analysis:
- Generate a scatterplot.
- Calculate the correlation coefficient .
- Determine if the correlation is significant (use level of significance of 0.05).
- Construct a linear regression model.
- Use the model to predict the revenue for a certain month where the amount spent on advertising is $100.
Month | Advertising Spend | Revenue |
---|---|---|
Jan | 49 | 12210 |
Feb | 145 | 17590 |
Mar | 57 | 13215 |
Apr | 153 | 19200 |
May | 92 | 14600 |
Jun | 83 | 14100 |
Jul | 117 | 17100 |
Aug | 142 | 18400 |
Sep | 69 | 14100 |
Oct | 106 | 15500 |
Nov | 109 | 16300 |
Dec | 121 | 17020 |
Solution
- The first step in the analysis is to generate a scatterplot. (Recall that scatterplots were also discussed in Scatterplots and Correlation.)
The Python functionplt.scatter()
can be used to generate the scatterplot for the data. Note that we consider advertising to be the independent variable (-data) and revenue to be the dependent variable (-data) since revenue depends on the amount of advertising.
Here is the Python code to generate the scatterplot:
Python Code
# import the matplotlib library
import matplotlib.pyplot as plt
# define the x-data, which is amount spent on advertising
x = [49, 145, 57, 153, 92, 83, 117, 142, 69, 106, 109, 121]
# define the y-data, which is revenue
y = [12210, 17590, 13215, 19200, 14600, 14100, 17100, 18400, 14100, 15500, 16300, 17020]
# use the scatter function to generate a time series graph
plt.scatter(x, y)
# Add a title using the title function
plt.title("Revenue versus Advertising for a Company")
# Add labels to the x and y axes by using xlabel and ylabel functions
plt.xlabel("Advertising $000")
plt.ylabel ("Revenue $000")
The resulting output will look like this:
- The Python function
pearsonr()
can be used to generate the correlation coefficient, .
Here is the Python code to calculate the correlation coefficient:
Python Code
# import libraries
import numpy as np
from scipy.stats import pearsonr
# define the x-data, which is amount spent on advertising
x = [49, 145, 57, 153, 92, 83, 117, 142, 69, 106, 109, 121]
# define the y-data, which is revenue
y = [12210, 17590, 13215, 19200, 14600, 14100, 17100, 18400, 14100, 15500, 16300, 17020]
#pearson r returns both value of r and corresponding p-value
r, p = pearsonr(x, y)
#print value of r, rounded to 3 decimal places
print("Correlation coefficient: ", round(r, 3))
The resulting output will look like this:
Correlation coefficient: 0.981
This value of indicates a strong, positive correlation between advertising spend and revenue.
- When using a level of significance of 0.05, if , then this implies that the correlation between the two variables demonstrates a linear relationship that is statistically significant at approximately the 0.05 level of significance. In this example, we check if is greater than or equal to where . Since is approximately 0.577, this indicates that the r-value of 0.981 is greater than , and thus the correlation is deemed significant.
- The linear regression model will be of the form:
To determine the slope and y-intercept of the regression model, the Python functionlinregress()
can be used.
This function takes the and data as inputs and provides the slope and intercept as well as other regression-related output.
Here is the Python code to generate the slope and y-intercept of the best-fit line:
Python Code
# import libraries
import numpy as np
from scipy.stats import linregress
# define the x-data, which is amount spent on advertising
x = [49, 145, 57, 153, 92, 83, 117, 142, 69, 106, 109, 121]
# define the y-data, which is revenue
y = [12210, 17590, 13215, 19200, 14600, 14100, 17100, 18400, 14100, 15500, 16300, 17020]
#pearson r returns both value of r and corresponding p-value
slope, intercept, r_value, p_value, std_err = linregress(x, y)
print ("slope =", slope)
print("intercept = ", intercept)
The resulting output will look like this:
slope = 61.79775818816664
intercept = 9376.698881009072
Based on this Python equation, the regression equation can be written as
where represents the amount spent on advertising (in thousands of dollars) and represents the amount of revenue (in thousands of dollars).
It is also useful to plot this best-fit line superimposed on the scatterplot to show the strong correlation between advertising spend and revenue, as follows (see Figure 4.9):
- The regression equation is: .
The given amount of advertising spend of $100 corresponds to a given x-value, so replace “” with 100 in this regression equation to generate a predicted revenue:
Since both advertising and revenue are expressed in thousands of dollars, the conclusion is that if advertising spend is $100,000, the predicted revenue for the company will be $15,556,700.
Interpret and Apply the Slope and y-Intercept
The slope of the line, , describes how changes in the variables are related. It is important to interpret the slope of the line in the context of the situation represented by the data. You should be able to write a sentence interpreting the slope in plain English.
Interpretation of the Slope
The slope of the best-fit line tells us how the dependent variable () changes for every one-unit increase in the independent () variable, on average.
In the previous example, the linear regression model for the monthly amount spent on advertising and the monthly revenue for a company for 12 months was generated as follows:
Since the slope was determined to be 61.8, the company can interpret this to mean that for every $1,000 spent on advertising, on average, this will result in an increase in revenues of $61,800.
The intercept of the regression equation is the corresponding y-value when .
Interpretation of the Intercept
The intercept of the best-fit line tells us the expected mean value of in the case where the -variable is equal to zero.
However, in many scenarios it may not make sense to have the -variable equal zero, and in these cases, the intercept does not have any meaning in the context of the problem. In other examples, the x-value of zero is outside the range of the -data that was collected. In this case, we should not assign any interpretation to the y-intercept.
In the previous example, the range of data collected for the -variable was from $49,000 to $153,000 spent per month on advertising. Since this interval does not include an x-value of zero, we would not provide any interpretation for the intercept.