Skip to ContentGo to accessibility pageKeyboard shortcuts menu
OpenStax Logo
Principles of Data Science

6.3 Machine Learning in Regression Analysis

Principles of Data Science6.3 Machine Learning in Regression Analysis

Learning Outcomes

By the end of this section, you should be able to:

  • 6.3.1 Use bootstrapping to analyze variation in linear regression.
  • 6.3.2 Outline assumptions relevant to a multiple linear regression.
  • 6.3.3 Perform multiple linear regressions and analyze significance of coefficients.

Regression is a term that applies to many different techniques in data analysis and machine learning. The main idea is to model a relationship between one or more input (or independent) variables XX and an output (or dependent) variable y by creating an explicit functional equation y=f(X)y=f(X). Once the equation is found, new inputs can be plugged into it and the results would serve as predictions. The types of regressions are based on the form of the equations that are generated: linear, polynomial, logistic, exponential, etc. Most of these types of regression analyses may be done on single inputs or using multiple input variables. Some of these topics have already been discussed in this textbook, and so in this chapter, we delve into a more advanced analysis. First we explore bootstrapping as a way to find more information about the reliability and variability of the parameters of a linear regression. Then we discuss multiple linear and logistic regressions, including how to perform these tasks in Python.

Linear Regression and Bootstrapping

The fundamentals of linear regression were presented in Inferential Statistics and Regression Analysis. While linear regression is a very powerful tool, it may suffer from inaccuracies due to noisy data. How confident can we be in the parameters of a linear model? In this section we use a resampling technique called bootstrapping, which is aimed at estimating variation to build a linear regression model with parameters within a specified confidence interval. Recall that we defined bootstrapping in Bootstrapping Methods as a technique in which repeated samples are taken hundreds or thousands of times and then the sample mean is calculated for each sample.

Bootstrapping Fundamentals

Recall, for a set of bivariate data—that is, a set of points X={(xi,yi)}X={(xi,yi)} in the plane—the linear regression or line of best fit is a line with equation y=a+bxy=a+bx, in which the values of aa and bb are chosen to minimize the squared vertical distance from each yiyi to the predicted value, y^i=a+bxiy^i=a+bxi. The values of aa and bb clearly depend on the points of the dataset XX, and so these parameters are just our best guess estimates for the “true” values of aa and bb. Perhaps a different sampling of XX would yield slightly different values of aa and bb. We may not have the luxury of resampling from the population, and so we do the next best thing: Resample from the set XX itself! This process is called bootstrapping because you are attempting to “pull yourself up by your own bootstraps,” as the old saying goes. Here is an outline of the process.

  1. Choose a large number NN of samples taken from the dataset XX, with replacement (so repetitions of data are allowed).
  2. Find the linear regression for this set of NN samples. Record the aa (intercept) and bb (slope) values into lists AA and BB, respectively.
  3. Repeat steps 1 and 2 many times.
  4. The sets AA and BB represent distributions of possible values of the parameters aa and bb for the true linear fit, y=a+bxy=a+bx.
  5. Let a=MEAN(A)a=MEAN(A) and b=MEAN(B)b=MEAN(B). These are the parameters we will chose for our (bootstrapped) linear regression, y=a+bxy=a+bx.
  6. Find the standard deviation of AA and BB as well in order to produce a confidence interval for each parameter.

These steps will next be illustrated with the help of Python.

Linear Regression with Bootstrapping in Python

In NCAA Division I basketball, a statistic called BARTHAG provides a measure of many different factors that affect a team’s chances of winning against an average team in the same division. This statistic (along with many others) can be found in the dataset NCAA-2021-stats.csv. Let XX be the dataset consisting of pairs (BARTHAG, W), where W stands for the number of wins in 2021. There are 346 data points in this set. Here is the code that produces the scatterplot and regression line.

Python Code

      # Import libraries
      import pandas as pd  ## for dataset management
      import matplotlib.pyplot as plt ## for data visualization
      from sklearn.linear_model import LinearRegression
      
      # Read data
      data = pd.read_csv('NCAA-2021-stats.csv')
      data = data[['BARTHAG', 'W']].dropna()
      
      # Find the linear regression model
      model = LinearRegression()
      x = data['BARTHAG'].values.reshape(-1,1)
      y = data['W'].values.reshape(-1,1)
      model.fit(x,y)
      y_pred = model.predict(x)
      
      # Visualizing the linear regression on the scatterplot
      plt.scatter(x,y)
      plt.plot(x, y_pred, color='red')
      plt.show()
      
      a = model.intercept_
      b = model.coef_
      print("Linear model: y = %1.3f + %2.3f x" % (a, b))
    

The resulting output will look like this:

A scatterplot with 346 blue data points. The X axis ranges from 0 to 1 at increments of 0.2. The Y axis ranges from 0 to 25 at increments of 5. A red regression line increases from left to right from about 5 to 20.
    Linear model: y = 4.820 + 14.682 x
    

Then, the set is resampled 50 times, using 100 data points each time. A linear regression is run each time, producing a set of 50 values of aa and 50 values of bb. All 50 regression lines are shown on the same graph.

Python Code

      from sklearn.utils import resample

      data.plot.scatter(x='BARTHAG', y='W')
      
      # Build arrays for slopes and intercepts
      slopes = []
      intercepts = []
      
      # Resampling 100 points 50 times
      for i in range(50):
        data1 = resample(data, n_samples=100, replace=True)
      
        x = data1['BARTHAG'].values.reshape(-1,1)
        y = data1['W'].values.reshape(-1,1)
        model.fit(x,y)
        y_pred = model.predict(x)
      
        slopes.append(model.coef_[0][0])
        intercepts.append(model.intercept_[0])
      
        plt.plot(x, y_pred, color='red')
      
      plt.show()
    

The resulting output will look like this (results for bootstrapping models may slightly vary):

A scatterplot with 346 blue data points. The X axis ranges from 0 to 1 at increments of 0.2. The Y axis ranges from 0 to 25 at increments of 5. Fifty red regression lines increasing from left to right from about 5 to 20 represent the data being resampled 50 times, using 100 data points each time.

Note how all the lines are very close to one another, but there is some variation. The mean and standard deviation of intercepts and slopes may be found using the statistics library in Python.

Python Code

      import statistics as stats # for mean and standard deviation

      a_bar = stats.mean(intercepts)
      a_std = stats.stdev(intercepts)
      b_bar = stats.mean(slopes)
      b_std = stats.stdev(slopes)
      
      print("Intercept: Mean = %1.3f, St.Dev. = %1.3f " % (a_bar, a_std) )
      print("Slope: Mean = %1.3f, St.Dev. = %1.3f " % (b_bar, b_std) )
    

The resulting output will look like this:

Intercept: Mean = 4.947, St.Dev. = 0.645
Slope: Mean = 14.489, St.Dev. = 1.267

Recall that, before bootstrapping, we obtained a linear model with intercept 4.82 and slope 14.682. These are very close to the intercept and slope obtained from the bootstrapping method. The advantage to bootstrapping, though, is that we now have measures of spread (standard deviation), and so we can produce a confidence interval for both parameters. Recall that the 95% confidence interval is found at plus or minus (1.96)σ/N(1.96)σ/N, where σσ is the standard deviation (See Bootstrapping Methods). This can be done in Python using the scipy.stats package.

Python Code

      import scipy.stats as st
      import numpy as np
      
      a_min, a_max = st.t.interval(0.95, df=len(intercepts)-1, loc=np.mean(intercepts), scale=st.sem(intercepts))
      b_min, b_max = st.t.interval(0.95, df=len(slopes)-1, loc=np.mean(slopes), scale=st.sem(slopes))
      
      print("95%% confidence interval for intercept: %1.2f < a < %1.2f " % (a_min, a_max) )
      print("95%% confidence interval for slope: %1.2f < b < %1.2f " % (b_min, b_max) )
    

The resulting output will look like this:

95% confidence interval for intercept: 4.76 < a < 5.13
95% confidence interval for slope: 14.13 < b < 14.85

Thus, we expect that the true value of the intercept is between 4.76 and 5.13 with 95% probability, and the true value of the slope is between 14.13 and 14.85 with 95% probability.

Multiple Regression Techniques

In a simple linear regression, one variable (x)(x) serves as the predictor for the output variable (y)(y). Many problems in data science, however, go well beyond a single predictor. There may be dozens of features in the data that work together to predict the output. In this context, we may use multiple regression techniques, which use more than one input variable. In this section, we focus on two main concepts: multiple linear regression and multiple logistic regression.

Multiple Linear Regression

A model for multiple linear regression involves some number of predictor variables, which we will label X1,X2,XrX1,X2,Xr. Here, we are using capital letters to distinguish these feature variables from data values like xixi. The multiple regression model takes the form

y=a+b1X1+b2X2++brXry=a+b1X1+b2X2++brXr

In the equation, aa is the intercept (value of yy when all Xk=0Xk=0), and the values of bkbk are the coefficients of the predictors, which function much like the slope bb in the simple linear regression model. A positive bkbk indicates that the feature measured by XkXk has a positive correlation on values of yy, while a negative bkbk indicates a negative correlation. More sophisticated models also introduce an error term, EE, to account for unexplained variability in yy.

The goal of multiple linear regression is to determine (or estimate) values of the parameters aa and bkbk (for k=1,2,,rk=1,2,,r) that minimize the squared vertical distance between the actual and predicted outputs. There are well-known formulas for this task, and typically the regression is handled by a computer.

There are certain assumptions that need to be checked in order to use multiple regression. The first two are about the nature of the input and output data itself. Generally, these data must be continuous.

  1. The output yy should be continuous, not discrete. However, regression could still be used if the output is a discrete variable with a very large, regular range of values (such as the number of people who live in a given country, which could be any positive integer).
  2. The input variables, XkXk, should be continuous, not discrete. However, the same comment as in (1) applies here: If the variable has a very large, regular range of possible values, then regression should work fine.

The next two assumptions ensure that multiple regression would provide the best model. If these assumptions are not met, then a more flexible model would be required.

  1. There should be a linear relationship between XkXk and for each kk, regarding the other variables as constant. This can be checked by running simple linear regressions on the pair (Xk,y)(Xk,y) for each kk and verifying linearity.
  2. The data should not have multicollinearity. Multicollinearity means that at least two of the input variables XjXj, XkXk are highly correlated with one another, in which case one (or more) of the variables can be dropped.

Finally, there are a number of technical assumptions that should be verified, including checking for outliers, making sure that the variance is more or less uniform at all parts of the domain (this is known as homoscedasticity), and analyzing the residuals to determine if they are fairly independent and normally distributed. More advanced analysis might result in the determination that some variables are not significant in the model. These further considerations fall outside the scope of this text, however.

Multiple Linear Regression in Python

In this section, we work through the Python code for multiple linear regression using the dataset NCAA-2021-stats.csv. As in Machine Learning in Regression Analysis, we would like to predict the number of wins (W), but now we will include more than one predictor feature. In the first code block, we will use BARTHAG and WAB (which is known as wins above bubble, a measure of how many more wins this team has in comparison to a typical team that just makes it into the tournament on a similar schedule) to predict W.

Python Code

      # Import libraries
      import pandas as pd  ## for dataset management
      import matplotlib.pyplot as plt ## for data visualization
      from sklearn.linear_model import LinearRegression
      
      # Read data
      data = pd.read_csv('NCAA-2021-stats.csv')
      data = data.drop(['SEED'], axis=1).dropna() ## Remove the SEED column
      
      X = data[['BARTHAG', 'WAB']]  # Feature list
      y = data[['W']]  # Response variable
      
      # Create the multiple linear regression model
      model = LinearRegression()
      model.fit(X, y)
      
      b = model.intercept_[0]
      a1, a2 = model.coef_[0]
      
      print("Intercept: ", round(b,3))
      print("Coefficients: ", [round(a1,3), round(a2,3)])
      print("Score: ", round(model.score(X,y),2))
    

The resulting output will look like this:

Intercept:  12.868
Coefficients:  [4.794, 0.545]
Score:  0.58

According to the output, the multiple regression equation for this model is:

W=12.868+4.794×(BARTHAG)+0.545×(WAB)W=12.868+4.794×(BARTHAG)+0.545×(WAB)

We can see that both BARTHAG and WAB have positive coefficients, and so as either variable increases, there is a corresponding increase in W. The R-squared score indicates that about 58% of the total variance in W is due to BARTHAG and WAB jointly. Let’s take a look at the data and see how well the points conform to the regression plane. First, here is the code to generate the scatterplot of points (BARTHAG, WAB, W) in three dimensions. (Make sure that you have already run the previous code beforehand.)

Python Code

      # import libraries for graphing
      import matplotlib.pyplot as plt
      from mpl_toolkits.mplot3d import Axes3D
      
      X1 = X[['BARTHAG']]
      X2 = X[['WAB']]
      
      # Create a 3D figure
      fig = plt.figure(figsize=(10, 6))
      ax = fig.add_subplot(111, projection='3d')
      
      # Scatterplot of actual data points
      ax.set_xlabel('BARTHAG')
      ax.set_ylabel('WAB', rotation=45)
      ax.set_zlabel('W')
      ax.set_title('Actual Values')
      ax.set_box_aspect(aspect=None, zoom=0.85)
      ax.scatter(X1, X2, y, c='b', marker='o')
    

The resulting output will look like this:

Three-dimensional scatterplot showing a cluster of blue dots representing data points. Axes are labeled “BARTHAG“ (0 to 1), “WAB” (-15 to 10), and “W” (0 to 25).

Next, let’s take a look at the predicted values found by plugging BARTHAG and WAB into the multiple regression formula discussed previously in the chapter. We also include the regression plane. The Python code required to create the regression plane is somewhat involved, requiring another package, numpy, to help create a mesh grid of (x,y)(x,y) points, so don’t worry too much if you don’t understand that part of the code.

Python Code

      import numpy as np

      # Create a 3D figure
      fig = plt.figure(figsize=(10, 6))
      ax = fig.add_subplot(111, projection='3d')
      
      # Generate prediction values using the model
      y_pred = model.predict(X)
      
      # Create a grid of points for the regression plane
      X1min = min(X1.values)[0]
      X1max = max(X1.values)[0]
      X2min = min(X2.values)[0]
      X2max = max(X2.values)[0]
      X1_space = np.linspace(X1min, X1max, 100)
      X2_space = np.linspace(X2min, X2max, 100)
      x1, x2 = np.meshgrid(X1_space, X2_space)
      
      # Use the regression equation to find z-values
      z = b + a1 * x1 + a2 * x2
      
      # Plot the regression plane
      ax.set_xlabel('BARTHAG')
      ax.set_ylabel('WAB', rotation=45)
      ax.set_zlabel('W')
      ax.set_title('Multiple Linear Regression Plane and Predicted Values')
      ax.set_box_aspect(aspect=None, zoom=0.85)
      ax.scatter(X1, X2, y_pred, c='r', marker='x')
      ax.plot_surface(x1, x2, z, alpha=0.3, cmap='plasma')
    

The resulting output will look like this:

Three-dimensional scatterplot showing multiple linear regression model fit. Red x's represent data points, and a tilted plane represents the regression model. Axes are labeled “BARTHAG” (0 to 1), “WAB” (-15 to 10), and “W” (2.5 to 22.5).

It may not be entirely clear from the graph, but each red “x” is a point on the regression plane. To see how well the regression formula does at modeling the actual data, we will remove the predicted values and put in the actual values together with line segments showing how close those values are to the plane.

Python Code

      # Create a 3D figure
      fig = plt.figure(figsize=(10, 6))
      ax = fig.add_subplot(111, projection='3d')
      
      # Add line segments from actual data points to predicted points
      for i in range(len(X1)):
          ax.plot([X1.values[i,0], X1.values[i,0]], 
           [X2.values[i,0], X2.values[i,0]], 
            [y.values[i,0], y_pred[i,0]], c='g', linewidth=1)
      
      # Plot the regression plane
      ax.set_xlabel('BARTHAG')
      ax.set_ylabel('WAB', rotation=45)
      ax.set_zlabel('W')
      ax.set_title('Multiple Linear Regression')
      ax.set_box_aspect(aspect=None, zoom=0.85)
      ax.scatter(X1, X2, y, c='b', marker='o')
      ax.plot_surface(x1, x2, z, alpha=0.3, cmap='plasma')
    

The resulting output will look like this:

Three-dimensional scatterplot showing multiple linear regression model fit. Purple dots represent data points, and a tilted plane represents the regression model. Axes are labeled “BARTHAG” (0 to 1), “WAB” (-15 to 10), and “W” (0 to 25).

Now, let’s see if we can improve the model by adding more features. Let’s add TOR (turnover percentage), 2P_O (two-pointer shooting percentage), and 3P_O (three-pointer shooting percentage).

Python Code

      X = data[['BARTHAG', 'WAB', 'TOR']]  # Feature list
      y = data[['W']]  # Response variable
      
      # Create the multiple linear regression model
      model = LinearRegression()
      model.fit(X, y)
      
      b = model.intercept_[0]
      a1, a2, a3 = model.coef_[0]
      
      print("Intercept: ", round(b,3))
      print("Coefficients: ", [round(a1,3), round(a2,3), round(a3,3)])
      print("Score: ", round(model.score(X,y),2))
    

The resulting output will look like this:

Intercept:  16.116
Coefficients:  [4.502, 0.526, -0.169]
Score:  0.59

The formula now expands to:

W=16.116+4.502×(BARTHAG)+0.526×(WAB)0.169×(TOR)W=16.116+4.502×(BARTHAG)+0.526×(WAB)0.169×(TOR)

The R-squared measure has increased slightly to 0.59, meaning that the new model explains a bit more of the variance. However, the small increase in prediction accuracy comes at a cost of adding complexity to the model. Simpler models tend to do better at avoiding overfitting in practice, so we will stick to the original two feature variables for this example.

Multiple Logistic Regression

As the name suggests, multiple logistic regression is an extension of the single-variable logistic regression method to two or more input variables. The model estimates the probability of the dependent variable (y)(y) taking on a particular category such as yes or no. Thus, the model is a binary classifier with multiple independent inputs. The standard multiple logistic regression model takes the form:

logit(y)=a+b1X1+b2X2++brXrlogit(y)=a+b1X1+b2X2++brXr

Here, the logit function is the same as described in section 6.2.1. Multiple logistic regression works just like single logistic regression in the sense that the logit function is used to transform the data, a (multiple) linear regression is performed, and the results are transformed back by a logistic function. Note that the features may be continuous numerical data or discrete data, including Boolean (1 or 0 for True/False or Yes/No).

Multiple Logistic Regression in Python

Revisit the dataset CollegeCompletionData.csv from The Confusion Matrix. It has two feature columns, “GPA” and “In-State.” Last time, we only used one feature to predict completion. It is just as easy to use multiple features!

Python Code

      # Import libraries
      import pandas as pd  ## for dataset management
      import matplotlib.pyplot as plt ## for data visualization
      import seaborn as sns ## for heatmap visualization
      from sklearn.linear_model import LogisticRegression
      from sklearn.metrics import confusion_matrix
      
      # Read data
      data = pd.read_csv('CollegeCompletionData.csv').dropna()
      X = data[['GPA','In_State']]
      y = data['Completion']
      
      # Build the logistic model
      model = LogisticRegression()
      model.fit(X,y)
      
      # Generate model predictions
      y_pred = model.predict(X)
      
      # Generate the confusion matrix
      cf = confusion_matrix(y, y_pred)
      
      # Plot the heatmap using seaborn and matplotlib
      sns.heatmap(cf, annot=True, fmt='d', cmap='Blues', cbar=True)
      plt.xlabel('Predicted')
      plt.ylabel('True')
      plt.yticks(rotation=0)
      plt.title('Confusion Matrix')
      plt.show()
    

The resulting output will look like:

A confusion matrix visualizing the performance of a binary classification model. The matrix has two rows and two columns, representing the true and predicted classes, respectively. The diagonal cells (top-left and bottom-right) show the number of correctly classified instances (true positives and true negatives), while the off-diagonal cells show the number of misclassified instances (false positives and false negatives). Clockwise from top left the boxes read: 25, 5, 23, 9. A color scale on the right indicates the frequency of each value. It runs up the right side of the matrix starting as light blue at 5.0 and getting darker at the top at 25. The color of the boxes aligns with this scale.

Recall that the rows of the confusion matrix are labeled by actual values while the columns are labeled by predicted values. Thus, there are 25 true positives (TP) and 23 true negatives (TN), increasing the accuracy to 48/62=77%48/62=77%. The number of false negatives (FN) dropped from 10 down to 5, but the number of false positives (FP) increased by 1, which is not a bad trade-off.

Citation/Attribution

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution-NonCommercial-ShareAlike License and you must attribute OpenStax.

Attribution information
  • If you are redistributing all or part of this book in a print format, then you must include on every physical page the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
  • If you are redistributing all or part of this book in a digital format, then you must include on every digital page view the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
Citation information

© Dec 19, 2024 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.