Dr. Shaun V. Ault; Dr. Soohyun Nam Liao; Larry Musolino

Learning Outcomes

By the end of this section, you should be able to:

6.3.1 Use bootstrapping to analyze variation in linear regression.
6.3.2 Outline assumptions relevant to a multiple linear regression.
6.3.3 Perform multiple linear regressions and analyze significance of coefficients.

Regression is a term that applies to many different techniques in data analysis and machine learning. The main idea is to model a relationship between one or more input (or independent) variables $X$ and an output (or dependent) variable y by creating an explicit functional equation $y = f (X)$ . Once the equation is found, new inputs can be plugged into it and the results would serve as predictions. The types of regressions are based on the form of the equations that are generated: linear, polynomial, logistic, exponential, etc. Most of these types of regression analyses may be done on single inputs or using multiple input variables. Some of these topics have already been discussed in this textbook, and so in this chapter, we delve into a more advanced analysis. First we explore bootstrapping as a way to find more information about the reliability and variability of the parameters of a linear regression. Then we discuss multiple linear and logistic regressions, including how to perform these tasks in Python.

Linear Regression and Bootstrapping

The fundamentals of linear regression were presented in Inferential Statistics and Regression Analysis. While linear regression is a very powerful tool, it may suffer from inaccuracies due to noisy data. How confident can we be in the parameters of a linear model? In this section we use a resampling technique called bootstrapping, which is aimed at estimating variation to build a linear regression model with parameters within a specified confidence interval. Recall that we defined bootstrapping in Bootstrapping Methods as a technique in which repeated samples are taken hundreds or thousands of times and then the sample mean is calculated for each sample.

Bootstrapping Fundamentals

Recall, for a set of bivariate data—that is, a set of points $X = {(x_{i}, y_{i})}$ in the plane—the linear regression or line of best fit is a line with equation $y = a + b x$ , in which the values of $a$ and $b$ are chosen to minimize the squared vertical distance from each $y_{i}$ to the predicted value, ${\hat{y}}_{i} = a + b x_{i}$ . The values of $a$ and $b$ clearly depend on the points of the dataset $X$ , and so these parameters are just our best guess estimates for the “true” values of $a$ and $b$ . Perhaps a different sampling of $X$ would yield slightly different values of $a$ and $b$ . We may not have the luxury of resampling from the population, and so we do the next best thing: Resample from the set $X$ itself! This process is called bootstrapping because you are attempting to “pull yourself up by your own bootstraps,” as the old saying goes. Here is an outline of the process.

Choose a large number $N$ of samples taken from the dataset $X$ , with replacement (so repetitions of data are allowed).
Find the linear regression for this set of $N$ samples. Record the $a$ (intercept) and $b$ (slope) values into lists $A$ and $B$ , respectively.
Repeat steps 1 and 2 many times.
The sets $A$ and $B$ represent distributions of possible values of the parameters $a$ and $b$ for the true linear fit, $y = a + b x$ .
Let $\bar{a} = M E A N (A)$ and $\bar{b} = M E A N (B)$ . These are the parameters we will chose for our (bootstrapped) linear regression, $y = \bar{a} + \bar{b} x$ .
Find the standard deviation of $A$ and $B$ as well in order to produce a confidence interval for each parameter.

These steps will next be illustrated with the help of Python.

Linear Regression with Bootstrapping in Python

In NCAA Division I basketball, a statistic called BARTHAG provides a measure of many different factors that affect a team’s chances of winning against an average team in the same division. This statistic (along with many others) can be found in the dataset NCAA-2021-stats.csv. Let $X$ be the dataset consisting of pairs (BARTHAG, W), where W stands for the number of wins in 2021. There are 346 data points in this set. Here is the code that produces the scatterplot and regression line.

Python Code

      # Import libraries
      import pandas as pd  ## for dataset management
      import matplotlib.pyplot as plt ## for data visualization
      from sklearn.linear_model import LinearRegression
      
      # Read data
      data = pd.read_csv('NCAA-2021-stats.csv')
      data = data[['BARTHAG', 'W']].dropna()
      
      # Find the linear regression model
      model = LinearRegression()
      x = data['BARTHAG'].values.reshape(-1,1)
      y = data['W'].values.reshape(-1,1)
      model.fit(x,y)
      y_pred = model.predict(x)
      
      # Visualizing the linear regression on the scatterplot
      plt.scatter(x,y)
      plt.plot(x, y_pred, color='red')
      plt.show()
      
      a = model.intercept_
      b = model.coef_
      print("Linear model: y = %1.3f + %2.3f x" % (a, b))

The resulting output will look like this:

A scatterplot with 346 blue data points. The X axis ranges from 0 to 1 at increments of 0.2. The Y axis ranges from 0 to 25 at increments of 5. A red regression line increases from left to right from about 5 to 20.

    Linear model: y = 4.820 + 14.682 x

Then, the set is resampled 50 times, using 100 data points each time. A linear regression is run each time, producing a set of 50 values of $a$ and 50 values of $b$ . All 50 regression lines are shown on the same graph.

Python Code

      from sklearn.utils import resample

      data.plot.scatter(x='BARTHAG', y='W')
      
      # Build arrays for slopes and intercepts
      slopes = []
      intercepts = []
      
      # Resampling 100 points 50 times
      for i in range(50):
        data1 = resample(data, n_samples=100, replace=True)
      
        x = data1['BARTHAG'].values.reshape(-1,1)
        y = data1['W'].values.reshape(-1,1)
        model.fit(x,y)
        y_pred = model.predict(x)
      
        slopes.append(model.coef_[0][0])
        intercepts.append(model.intercept_[0])
      
        plt.plot(x, y_pred, color='red')
      
      plt.show()

The resulting output will look like this (results for bootstrapping models may slightly vary):

A scatterplot with 346 blue data points. The X axis ranges from 0 to 1 at increments of 0.2. The Y axis ranges from 0 to 25 at increments of 5. Fifty red regression lines increasing from left to right from about 5 to 20 represent the data being resampled 50 times, using 100 data points each time.

Note how all the lines are very close to one another, but there is some variation. The mean and standard deviation of intercepts and slopes may be found using the statistics library in Python.

Python Code

      import statistics as stats # for mean and standard deviation

      a_bar = stats.mean(intercepts)
      a_std = stats.stdev(intercepts)
      b_bar = stats.mean(slopes)
      b_std = stats.stdev(slopes)
      
      print("Intercept: Mean = %1.3f, St.Dev. = %1.3f " % (a_bar, a_std) )
      print("Slope: Mean = %1.3f, St.Dev. = %1.3f " % (b_bar, b_std) )

The resulting output will look like this:

Intercept: Mean = 4.947, St.Dev. = 0.645
Slope: Mean = 14.489, St.Dev. = 1.267

Recall that, before bootstrapping, we obtained a linear model with intercept 4.82 and slope 14.682. These are very close to the intercept and slope obtained from the bootstrapping method. The advantage to bootstrapping, though, is that we now have measures of spread (standard deviation), and so we can produce a confidence interval for both parameters. Recall that the 95% confidence interval is found at plus or minus $(1.96) σ / \sqrt{N}$ , where $σ$ is the standard deviation (See Bootstrapping Methods). This can be done in Python using the scipy.stats package.

Python Code

      import scipy.stats as st
      import numpy as np
      
      a_min, a_max = st.t.interval(0.95, df=len(intercepts)-1, loc=np.mean(intercepts), scale=st.sem(intercepts))
      b_min, b_max = st.t.interval(0.95, df=len(slopes)-1, loc=np.mean(slopes), scale=st.sem(slopes))
      
      print("95%% confidence interval for intercept: %1.2f < a < %1.2f " % (a_min, a_max) )
      print("95%% confidence interval for slope: %1.2f < b < %1.2f " % (b_min, b_max) )

The resulting output will look like this:

95% confidence interval for intercept: 4.76 < a < 5.13
95% confidence interval for slope: 14.13 < b < 14.85

Thus, we expect that the true value of the intercept is between 4.76 and 5.13 with 95% probability, and the true value of the slope is between 14.13 and 14.85 with 95% probability.

Multiple Regression Techniques

In a simple linear regression, one variable $(x)$ serves as the predictor for the output variable $(y)$ . Many problems in data science, however, go well beyond a single predictor. There may be dozens of features in the data that work together to predict the output. In this context, we may use multiple regression techniques, which use more than one input variable. In this section, we focus on two main concepts: multiple linear regression and multiple logistic regression.

Multiple Linear Regression

A model for multiple linear regression involves some number of predictor variables, which we will label $X_{1}, X_{2}, \dots X_{r}$ . Here, we are using capital letters to distinguish these feature variables from data values like $x_{i}$ . The multiple regression model takes the form

y = a + b_{1} X_{1} + b_{2} X_{2} + \dots + b_{r} X_{r}

In the equation, $a$ is the intercept (value of $y$ when all $X_{k} = 0$ ), and the values of $b_{k}$ are the coefficients of the predictors, which function much like the slope $b$ in the simple linear regression model. A positive $b_{k}$ indicates that the feature measured by $X_{k}$ has a positive correlation on values of $y$ , while a negative $b_{k}$ indicates a negative correlation. More sophisticated models also introduce an error term, $E$ , to account for unexplained variability in $y$ .

The goal of multiple linear regression is to determine (or estimate) values of the parameters $a$ and $b_{k}$ (for $k = 1, 2, \dots, r$ ) that minimize the squared vertical distance between the actual and predicted outputs. There are well-known formulas for this task, and typically the regression is handled by a computer.

There are certain assumptions that need to be checked in order to use multiple regression. The first two are about the nature of the input and output data itself. Generally, these data must be continuous.

The output $y$ should be continuous, not discrete. However, regression could still be used if the output is a discrete variable with a very large, regular range of values (such as the number of people who live in a given country, which could be any positive integer).
The input variables, $X_{k}$ , should be continuous, not discrete. However, the same comment as in (1) applies here: If the variable has a very large, regular range of possible values, then regression should work fine.

The next two assumptions ensure that multiple regression would provide the best model. If these assumptions are not met, then a more flexible model would be required.

There should be a linear relationship between $X_{k}$ and for each $k$ , regarding the other variables as constant. This can be checked by running simple linear regressions on the pair $(X_{k}, y)$ for each $k$ and verifying linearity.
The data should not have multicollinearity. Multicollinearity means that at least two of the input variables $X_{j}$ , $X_{k}$ are highly correlated with one another, in which case one (or more) of the variables can be dropped.

Finally, there are a number of technical assumptions that should be verified, including checking for outliers, making sure that the variance is more or less uniform at all parts of the domain (this is known as homoscedasticity), and analyzing the residuals to determine if they are fairly independent and normally distributed. More advanced analysis might result in the determination that some variables are not significant in the model. These further considerations fall outside the scope of this text, however.

Multiple Linear Regression in Python

In this section, we work through the Python code for multiple linear regression using the dataset NCAA-2021-stats.csv. As in Machine Learning in Regression Analysis, we would like to predict the number of wins (W), but now we will include more than one predictor feature. In the first code block, we will use BARTHAG and WAB (which is known as wins above bubble, a measure of how many more wins this team has in comparison to a typical team that just makes it into the tournament on a similar schedule) to predict W.

Python Code

      # Import libraries
      import pandas as pd  ## for dataset management
      import matplotlib.pyplot as plt ## for data visualization
      from sklearn.linear_model import LinearRegression
      
      # Read data
      data = pd.read_csv('NCAA-2021-stats.csv')
      data = data.drop(['SEED'], axis=1).dropna() ## Remove the SEED column
      
      X = data[['BARTHAG', 'WAB']]  # Feature list
      y = data[['W']]  # Response variable
      
      # Create the multiple linear regression model
      model = LinearRegression()
      model.fit(X, y)
      
      b = model.intercept_[0]
      a1, a2 = model.coef_[0]
      
      print("Intercept: ", round(b,3))
      print("Coefficients: ", [round(a1,3), round(a2,3)])
      print("Score: ", round(model.score(X,y),2))

The resulting output will look like this:

Intercept:  12.868
Coefficients:  [4.794, 0.545]
Score:  0.58

According to the output, the multiple regression equation for this model is:

W = 12.868 + 4.794 \times (BARTHAG) + 0.545 \times (WAB)

We can see that both BARTHAG and WAB have positive coefficients, and so as either variable increases, there is a corresponding increase in W. The R-squared score indicates that about 58% of the total variance in W is due to BARTHAG and WAB jointly. Let’s take a look at the data and see how well the points conform to the regression plane. First, here is the code to generate the scatterplot of points (BARTHAG, WAB, W) in three dimensions. (Make sure that you have already run the previous code beforehand.)

Python Code

      # import libraries for graphing
      import matplotlib.pyplot as plt
      from mpl_toolkits.mplot3d import Axes3D
      
      X1 = X[['BARTHAG']]
      X2 = X[['WAB']]
      
      # Create a 3D figure
      fig = plt.figure(figsize=(10, 6))
      ax = fig.add_subplot(111, projection='3d')
      
      # Scatterplot of actual data points
      ax.set_xlabel('BARTHAG')
      ax.set_ylabel('WAB', rotation=45)
      ax.set_zlabel('W')
      ax.set_title('Actual Values')
      ax.set_box_aspect(aspect=None, zoom=0.85)
      ax.scatter(X1, X2, y, c='b', marker='o')

The resulting output will look like this:

Three-dimensional scatterplot showing a cluster of blue dots representing data points. Axes are labeled “BARTHAG“ (0 to 1), “WAB” (-15 to 10), and “W” (0 to 25).

Next, let’s take a look at the predicted values found by plugging BARTHAG and WAB into the multiple regression formula discussed previously in the chapter. We also include the regression plane. The Python code required to create the regression plane is somewhat involved, requiring another package, numpy, to help create a mesh grid of $(x, y)$ points, so don’t worry too much if you don’t understand that part of the code.

Python Code

      import numpy as np

      # Create a 3D figure
      fig = plt.figure(figsize=(10, 6))
      ax = fig.add_subplot(111, projection='3d')
      
      # Generate prediction values using the model
      y_pred = model.predict(X)
      
      # Create a grid of points for the regression plane
      X1min = min(X1.values)[0]
      X1max = max(X1.values)[0]
      X2min = min(X2.values)[0]
      X2max = max(X2.values)[0]
      X1_space = np.linspace(X1min, X1max, 100)
      X2_space = np.linspace(X2min, X2max, 100)
      x1, x2 = np.meshgrid(X1_space, X2_space)
      
      # Use the regression equation to find z-values
      z = b + a1 * x1 + a2 * x2
      
      # Plot the regression plane
      ax.set_xlabel('BARTHAG')
      ax.set_ylabel('WAB', rotation=45)
      ax.set_zlabel('W')
      ax.set_title('Multiple Linear Regression Plane and Predicted Values')
      ax.set_box_aspect(aspect=None, zoom=0.85)
      ax.scatter(X1, X2, y_pred, c='r', marker='x')
      ax.plot_surface(x1, x2, z, alpha=0.3, cmap='plasma')

The resulting output will look like this:

Three-dimensional scatterplot showing multiple linear regression model fit. Red x's represent data points, and a tilted plane represents the regression model. Axes are labeled “BARTHAG” (0 to 1), “WAB” (-15 to 10), and “W” (2.5 to 22.5).

It may not be entirely clear from the graph, but each red “x” is a point on the regression plane. To see how well the regression formula does at modeling the actual data, we will remove the predicted values and put in the actual values together with line segments showing how close those values are to the plane.

Python Code

      # Create a 3D figure
      fig = plt.figure(figsize=(10, 6))
      ax = fig.add_subplot(111, projection='3d')
      
      # Add line segments from actual data points to predicted points
      for i in range(len(X1)):
          ax.plot([X1.values[i,0], X1.values[i,0]], 
           [X2.values[i,0], X2.values[i,0]], 
            [y.values[i,0], y_pred[i,0]], c='g', linewidth=1)
      
      # Plot the regression plane
      ax.set_xlabel('BARTHAG')
      ax.set_ylabel('WAB', rotation=45)
      ax.set_zlabel('W')
      ax.set_title('Multiple Linear Regression')
      ax.set_box_aspect(aspect=None, zoom=0.85)
      ax.scatter(X1, X2, y, c='b', marker='o')
      ax.plot_surface(x1, x2, z, alpha=0.3, cmap='plasma')

The resulting output will look like this:

Three-dimensional scatterplot showing multiple linear regression model fit. Purple dots represent data points, and a tilted plane represents the regression model. Axes are labeled “BARTHAG” (0 to 1), “WAB” (-15 to 10), and “W” (0 to 25).

Now, let’s see if we can improve the model by adding more features. Let’s add TOR (turnover percentage), 2P_O (two-pointer shooting percentage), and 3P_O (three-pointer shooting percentage).

Python Code

      X = data[['BARTHAG', 'WAB', 'TOR']]  # Feature list
      y = data[['W']]  # Response variable
      
      # Create the multiple linear regression model
      model = LinearRegression()
      model.fit(X, y)
      
      b = model.intercept_[0]
      a1, a2, a3 = model.coef_[0]
      
      print("Intercept: ", round(b,3))
      print("Coefficients: ", [round(a1,3), round(a2,3), round(a3,3)])
      print("Score: ", round(model.score(X,y),2))

The resulting output will look like this:

Intercept:  16.116
Coefficients:  [4.502, 0.526, -0.169]
Score:  0.59

The formula now expands to:

W = 16.116 + 4.502 \times (BARTHAG) + 0.526 \times (WAB) - 0.169 \times (TOR)

The R-squared measure has increased slightly to 0.59, meaning that the new model explains a bit more of the variance. However, the small increase in prediction accuracy comes at a cost of adding complexity to the model. Simpler models tend to do better at avoiding overfitting in practice, so we will stick to the original two feature variables for this example.

Multiple Logistic Regression

As the name suggests, multiple logistic regression is an extension of the single-variable logistic regression method to two or more input variables. The model estimates the probability of the dependent variable $(y)$ taking on a particular category such as yes or no. Thus, the model is a binary classifier with multiple independent inputs. The standard multiple logistic regression model takes the form:

logit (y) = a + b_{1} X_{1} + b_{2} X_{2} + \dots + b_{r} X_{r}

Here, the logit function is the same as described in section 6.2.1. Multiple logistic regression works just like single logistic regression in the sense that the logit function is used to transform the data, a (multiple) linear regression is performed, and the results are transformed back by a logistic function. Note that the features may be continuous numerical data or discrete data, including Boolean (1 or 0 for True/False or Yes/No).

Multiple Logistic Regression in Python

Revisit the dataset CollegeCompletionData.csv from The Confusion Matrix. It has two feature columns, “GPA” and “In-State.” Last time, we only used one feature to predict completion. It is just as easy to use multiple features!

Python Code

      # Import libraries
      import pandas as pd  ## for dataset management
      import matplotlib.pyplot as plt ## for data visualization
      import seaborn as sns ## for heatmap visualization
      from sklearn.linear_model import LogisticRegression
      from sklearn.metrics import confusion_matrix
      
      # Read data
      data = pd.read_csv('CollegeCompletionData.csv').dropna()
      X = data[['GPA','In_State']]
      y = data['Completion']
      
      # Build the logistic model
      model = LogisticRegression()
      model.fit(X,y)
      
      # Generate model predictions
      y_pred = model.predict(X)
      
      # Generate the confusion matrix
      cf = confusion_matrix(y, y_pred)
      
      # Plot the heatmap using seaborn and matplotlib
      sns.heatmap(cf, annot=True, fmt='d', cmap='Blues', cbar=True)
      plt.xlabel('Predicted')
      plt.ylabel('True')
      plt.yticks(rotation=0)
      plt.title('Confusion Matrix')
      plt.show()

The resulting output will look like:

A confusion matrix visualizing the performance of a binary classification model. The matrix has two rows and two columns, representing the true and predicted classes, respectively. The diagonal cells (top-left and bottom-right) show the number of correctly classified instances (true positives and true negatives), while the off-diagonal cells show the number of misclassified instances (false positives and false negatives). Clockwise from top left the boxes read: 25, 5, 23, 9. A color scale on the right indicates the frequency of each value. It runs up the right side of the matrix starting as light blue at 5.0 and getting darker at the top at 25. The color of the boxes aligns with this scale.

Recall that the rows of the confusion matrix are labeled by actual values while the columns are labeled by predicted values. Thus, there are 25 true positives (TP) and 23 true negatives (TN), increasing the accuracy to $48 / 62 = 77 %$ . The number of false negatives (FN) dropped from 10 down to 5, but the number of false positives (FP) increased by 1, which is not a bad trade-off.

6.3 Machine Learning in Regression Analysis

Learning Outcomes

Linear Regression and Bootstrapping

Bootstrapping Fundamentals

Linear Regression with Bootstrapping in Python

Multiple Regression Techniques

Multiple Linear Regression

Multiple Linear Regression in Python

Multiple Logistic Regression

Multiple Logistic Regression in Python