Learning Outcomes
By the end of this section, you should be able to:
- 6.3.1 Use bootstrapping to analyze variation in linear regression.
- 6.3.2 Outline assumptions relevant to a multiple linear regression.
- 6.3.3 Perform multiple linear regressions and analyze significance of coefficients.
Regression is a term that applies to many different techniques in data analysis and machine learning. The main idea is to model a relationship between one or more input (or independent) variables and an output (or dependent) variable y by creating an explicit functional equation . Once the equation is found, new inputs can be plugged into it and the results would serve as predictions. The types of regressions are based on the form of the equations that are generated: linear, polynomial, logistic, exponential, etc. Most of these types of regression analyses may be done on single inputs or using multiple input variables. Some of these topics have already been discussed in this textbook, and so in this chapter, we delve into a more advanced analysis. First we explore bootstrapping as a way to find more information about the reliability and variability of the parameters of a linear regression. Then we discuss multiple linear and logistic regressions, including how to perform these tasks in Python.
Linear Regression and Bootstrapping
The fundamentals of linear regression were presented in Inferential Statistics and Regression Analysis. While linear regression is a very powerful tool, it may suffer from inaccuracies due to noisy data. How confident can we be in the parameters of a linear model? In this section we use a resampling technique called bootstrapping, which is aimed at estimating variation to build a linear regression model with parameters within a specified confidence interval. Recall that we defined bootstrapping in Bootstrapping Methods as a technique in which repeated samples are taken hundreds or thousands of times and then the sample mean is calculated for each sample.
Bootstrapping Fundamentals
Recall, for a set of bivariate data—that is, a set of points in the plane—the linear regression or line of best fit is a line with equation , in which the values of and are chosen to minimize the squared vertical distance from each to the predicted value, . The values of and clearly depend on the points of the dataset , and so these parameters are just our best guess estimates for the “true” values of and . Perhaps a different sampling of would yield slightly different values of and . We may not have the luxury of resampling from the population, and so we do the next best thing: Resample from the set itself! This process is called bootstrapping because you are attempting to “pull yourself up by your own bootstraps,” as the old saying goes. Here is an outline of the process.
- Choose a large number of samples taken from the dataset , with replacement (so repetitions of data are allowed).
- Find the linear regression for this set of samples. Record the (intercept) and (slope) values into lists and , respectively.
- Repeat steps 1 and 2 many times.
- The sets and represent distributions of possible values of the parameters and for the true linear fit, .
- Let and . These are the parameters we will chose for our (bootstrapped) linear regression, .
- Find the standard deviation of and as well in order to produce a confidence interval for each parameter.
These steps will next be illustrated with the help of Python.
Linear Regression with Bootstrapping in Python
In NCAA Division I basketball, a statistic called BARTHAG provides a measure of many different factors that affect a team’s chances of winning against an average team in the same division. This statistic (along with many others) can be found in the dataset NCAA-2021-stats.csv. Let be the dataset consisting of pairs (BARTHAG, W), where W stands for the number of wins in 2021. There are 346 data points in this set. Here is the code that produces the scatterplot and regression line.
Python Code
# Import libraries
import pandas as pd ## for dataset management
import matplotlib.pyplot as plt ## for data visualization
from sklearn.linear_model import LinearRegression
# Read data
data = pd.read_csv('NCAA-2021-stats.csv')
data = data[['BARTHAG', 'W']].dropna()
# Find the linear regression model
model = LinearRegression()
x = data['BARTHAG'].values.reshape(-1,1)
y = data['W'].values.reshape(-1,1)
model.fit(x,y)
y_pred = model.predict(x)
# Visualizing the linear regression on the scatterplot
plt.scatter(x,y)
plt.plot(x, y_pred, color='red')
plt.show()
a = model.intercept_
b = model.coef_
print("Linear model: y = %1.3f + %2.3f x" % (a, b))
The resulting output will look like this:
Linear model: y = 4.820 + 14.682 x
Then, the set is resampled 50 times, using 100 data points each time. A linear regression is run each time, producing a set of 50 values of and 50 values of . All 50 regression lines are shown on the same graph.
Python Code
from sklearn.utils import resample
data.plot.scatter(x='BARTHAG', y='W')
# Build arrays for slopes and intercepts
slopes = []
intercepts = []
# Resampling 100 points 50 times
for i in range(50):
data1 = resample(data, n_samples=100, replace=True)
x = data1['BARTHAG'].values.reshape(-1,1)
y = data1['W'].values.reshape(-1,1)
model.fit(x,y)
y_pred = model.predict(x)
slopes.append(model.coef_[0][0])
intercepts.append(model.intercept_[0])
plt.plot(x, y_pred, color='red')
plt.show()
The resulting output will look like this (results for bootstrapping models may slightly vary):
Note how all the lines are very close to one another, but there is some variation. The mean and standard deviation of intercepts and slopes may be found using the statistics library in Python.
Python Code
import statistics as stats # for mean and standard deviation
a_bar = stats.mean(intercepts)
a_std = stats.stdev(intercepts)
b_bar = stats.mean(slopes)
b_std = stats.stdev(slopes)
print("Intercept: Mean = %1.3f, St.Dev. = %1.3f " % (a_bar, a_std) )
print("Slope: Mean = %1.3f, St.Dev. = %1.3f " % (b_bar, b_std) )
The resulting output will look like this:
Intercept: Mean = 4.947, St.Dev. = 0.645
Slope: Mean = 14.489, St.Dev. = 1.267
Recall that, before bootstrapping, we obtained a linear model with intercept 4.82 and slope 14.682. These are very close to the intercept and slope obtained from the bootstrapping method. The advantage to bootstrapping, though, is that we now have measures of spread (standard deviation), and so we can produce a confidence interval for both parameters. Recall that the 95% confidence interval is found at plus or minus , where is the standard deviation (See Bootstrapping Methods). This can be done in Python using the scipy.stats
package.
Python Code
import scipy.stats as st
import numpy as np
a_min, a_max = st.t.interval(0.95, df=len(intercepts)-1, loc=np.mean(intercepts), scale=st.sem(intercepts))
b_min, b_max = st.t.interval(0.95, df=len(slopes)-1, loc=np.mean(slopes), scale=st.sem(slopes))
print("95%% confidence interval for intercept: %1.2f < a < %1.2f " % (a_min, a_max) )
print("95%% confidence interval for slope: %1.2f < b < %1.2f " % (b_min, b_max) )
The resulting output will look like this:
95% confidence interval for intercept: 4.76 < a < 5.13
95% confidence interval for slope: 14.13 < b < 14.85
Thus, we expect that the true value of the intercept is between 4.76 and 5.13 with 95% probability, and the true value of the slope is between 14.13 and 14.85 with 95% probability.
Multiple Regression Techniques
In a simple linear regression, one variable serves as the predictor for the output variable . Many problems in data science, however, go well beyond a single predictor. There may be dozens of features in the data that work together to predict the output. In this context, we may use multiple regression techniques, which use more than one input variable. In this section, we focus on two main concepts: multiple linear regression and multiple logistic regression.
Multiple Linear Regression
A model for multiple linear regression involves some number of predictor variables, which we will label . Here, we are using capital letters to distinguish these feature variables from data values like . The multiple regression model takes the form
In the equation, is the intercept (value of when all ), and the values of are the coefficients of the predictors, which function much like the slope in the simple linear regression model. A positive indicates that the feature measured by has a positive correlation on values of , while a negative indicates a negative correlation. More sophisticated models also introduce an error term, , to account for unexplained variability in .
The goal of multiple linear regression is to determine (or estimate) values of the parameters and (for ) that minimize the squared vertical distance between the actual and predicted outputs. There are well-known formulas for this task, and typically the regression is handled by a computer.
There are certain assumptions that need to be checked in order to use multiple regression. The first two are about the nature of the input and output data itself. Generally, these data must be continuous.
- The output should be continuous, not discrete. However, regression could still be used if the output is a discrete variable with a very large, regular range of values (such as the number of people who live in a given country, which could be any positive integer).
- The input variables, , should be continuous, not discrete. However, the same comment as in (1) applies here: If the variable has a very large, regular range of possible values, then regression should work fine.
The next two assumptions ensure that multiple regression would provide the best model. If these assumptions are not met, then a more flexible model would be required.
- There should be a linear relationship between and for each , regarding the other variables as constant. This can be checked by running simple linear regressions on the pair for each and verifying linearity.
- The data should not have multicollinearity. Multicollinearity means that at least two of the input variables , are highly correlated with one another, in which case one (or more) of the variables can be dropped.
Finally, there are a number of technical assumptions that should be verified, including checking for outliers, making sure that the variance is more or less uniform at all parts of the domain (this is known as homoscedasticity), and analyzing the residuals to determine if they are fairly independent and normally distributed. More advanced analysis might result in the determination that some variables are not significant in the model. These further considerations fall outside the scope of this text, however.
Multiple Linear Regression in Python
In this section, we work through the Python code for multiple linear regression using the dataset NCAA-2021-stats.csv. As in Machine Learning in Regression Analysis, we would like to predict the number of wins (W), but now we will include more than one predictor feature. In the first code block, we will use BARTHAG and WAB (which is known as wins above bubble, a measure of how many more wins this team has in comparison to a typical team that just makes it into the tournament on a similar schedule) to predict W.
Python Code
# Import libraries
import pandas as pd ## for dataset management
import matplotlib.pyplot as plt ## for data visualization
from sklearn.linear_model import LinearRegression
# Read data
data = pd.read_csv('NCAA-2021-stats.csv')
data = data.drop(['SEED'], axis=1).dropna() ## Remove the SEED column
X = data[['BARTHAG', 'WAB']] # Feature list
y = data[['W']] # Response variable
# Create the multiple linear regression model
model = LinearRegression()
model.fit(X, y)
b = model.intercept_[0]
a1, a2 = model.coef_[0]
print("Intercept: ", round(b,3))
print("Coefficients: ", [round(a1,3), round(a2,3)])
print("Score: ", round(model.score(X,y),2))
The resulting output will look like this:
Intercept: 12.868
Coefficients: [4.794, 0.545]
Score: 0.58
According to the output, the multiple regression equation for this model is:
We can see that both BARTHAG and WAB have positive coefficients, and so as either variable increases, there is a corresponding increase in W. The R-squared score indicates that about 58% of the total variance in W is due to BARTHAG and WAB jointly. Let’s take a look at the data and see how well the points conform to the regression plane. First, here is the code to generate the scatterplot of points (BARTHAG, WAB, W) in three dimensions. (Make sure that you have already run the previous code beforehand.)
Python Code
# import libraries for graphing
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
X1 = X[['BARTHAG']]
X2 = X[['WAB']]
# Create a 3D figure
fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111, projection='3d')
# Scatterplot of actual data points
ax.set_xlabel('BARTHAG')
ax.set_ylabel('WAB', rotation=45)
ax.set_zlabel('W')
ax.set_title('Actual Values')
ax.set_box_aspect(aspect=None, zoom=0.85)
ax.scatter(X1, X2, y, c='b', marker='o')
The resulting output will look like this:
Next, let’s take a look at the predicted values found by plugging BARTHAG and WAB into the multiple regression formula discussed previously in the chapter. We also include the regression plane. The Python code required to create the regression plane is somewhat involved, requiring another package, numpy, to help create a mesh grid of points, so don’t worry too much if you don’t understand that part of the code.
Python Code
import numpy as np
# Create a 3D figure
fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111, projection='3d')
# Generate prediction values using the model
y_pred = model.predict(X)
# Create a grid of points for the regression plane
X1min = min(X1.values)[0]
X1max = max(X1.values)[0]
X2min = min(X2.values)[0]
X2max = max(X2.values)[0]
X1_space = np.linspace(X1min, X1max, 100)
X2_space = np.linspace(X2min, X2max, 100)
x1, x2 = np.meshgrid(X1_space, X2_space)
# Use the regression equation to find z-values
z = b + a1 * x1 + a2 * x2
# Plot the regression plane
ax.set_xlabel('BARTHAG')
ax.set_ylabel('WAB', rotation=45)
ax.set_zlabel('W')
ax.set_title('Multiple Linear Regression Plane and Predicted Values')
ax.set_box_aspect(aspect=None, zoom=0.85)
ax.scatter(X1, X2, y_pred, c='r', marker='x')
ax.plot_surface(x1, x2, z, alpha=0.3, cmap='plasma')
The resulting output will look like this:
It may not be entirely clear from the graph, but each red “x” is a point on the regression plane. To see how well the regression formula does at modeling the actual data, we will remove the predicted values and put in the actual values together with line segments showing how close those values are to the plane.
Python Code
# Create a 3D figure
fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111, projection='3d')
# Add line segments from actual data points to predicted points
for i in range(len(X1)):
ax.plot([X1.values[i,0], X1.values[i,0]],
[X2.values[i,0], X2.values[i,0]],
[y.values[i,0], y_pred[i,0]], c='g', linewidth=1)
# Plot the regression plane
ax.set_xlabel('BARTHAG')
ax.set_ylabel('WAB', rotation=45)
ax.set_zlabel('W')
ax.set_title('Multiple Linear Regression')
ax.set_box_aspect(aspect=None, zoom=0.85)
ax.scatter(X1, X2, y, c='b', marker='o')
ax.plot_surface(x1, x2, z, alpha=0.3, cmap='plasma')
The resulting output will look like this:
Now, let’s see if we can improve the model by adding more features. Let’s add TOR (turnover percentage), 2P_O (two-pointer shooting percentage), and 3P_O (three-pointer shooting percentage).
Python Code
X = data[['BARTHAG', 'WAB', 'TOR']] # Feature list
y = data[['W']] # Response variable
# Create the multiple linear regression model
model = LinearRegression()
model.fit(X, y)
b = model.intercept_[0]
a1, a2, a3 = model.coef_[0]
print("Intercept: ", round(b,3))
print("Coefficients: ", [round(a1,3), round(a2,3), round(a3,3)])
print("Score: ", round(model.score(X,y),2))
The resulting output will look like this:
Intercept: 16.116
Coefficients: [4.502, 0.526, -0.169]
Score: 0.59
The formula now expands to:
The R-squared measure has increased slightly to 0.59, meaning that the new model explains a bit more of the variance. However, the small increase in prediction accuracy comes at a cost of adding complexity to the model. Simpler models tend to do better at avoiding overfitting in practice, so we will stick to the original two feature variables for this example.
Multiple Logistic Regression
As the name suggests, multiple logistic regression is an extension of the single-variable logistic regression method to two or more input variables. The model estimates the probability of the dependent variable taking on a particular category such as yes or no. Thus, the model is a binary classifier with multiple independent inputs. The standard multiple logistic regression model takes the form:
Here, the logit function is the same as described in section 6.2.1. Multiple logistic regression works just like single logistic regression in the sense that the logit function is used to transform the data, a (multiple) linear regression is performed, and the results are transformed back by a logistic function. Note that the features may be continuous numerical data or discrete data, including Boolean (1 or 0 for True/False or Yes/No).
Multiple Logistic Regression in Python
Revisit the dataset CollegeCompletionData.csv from The Confusion Matrix. It has two feature columns, “GPA” and “In-State.” Last time, we only used one feature to predict completion. It is just as easy to use multiple features!
Python Code
# Import libraries
import pandas as pd ## for dataset management
import matplotlib.pyplot as plt ## for data visualization
import seaborn as sns ## for heatmap visualization
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
# Read data
data = pd.read_csv('CollegeCompletionData.csv').dropna()
X = data[['GPA','In_State']]
y = data['Completion']
# Build the logistic model
model = LogisticRegression()
model.fit(X,y)
# Generate model predictions
y_pred = model.predict(X)
# Generate the confusion matrix
cf = confusion_matrix(y, y_pred)
# Plot the heatmap using seaborn and matplotlib
sns.heatmap(cf, annot=True, fmt='d', cmap='Blues', cbar=True)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.yticks(rotation=0)
plt.title('Confusion Matrix')
plt.show()
The resulting output will look like:
Recall that the rows of the confusion matrix are labeled by actual values while the columns are labeled by predicted values. Thus, there are 25 true positives (TP) and 23 true negatives (TN), increasing the accuracy to . The number of false negatives (FN) dropped from 10 down to 5, but the number of false positives (FP) increased by 1, which is not a bad trade-off.