Dr. Shaun V. Ault; Dr. Soohyun Nam Liao; Larry Musolino

Learning Outcomes

By the end of this section, you should be able to:

6.1.1 Summarize the differences between supervised and unsupervised learning in machine learning.
6.1.2 Describe the roles of training and testing datasets.
6.1.3 Use common measures of accuracy and error to determine the fitness of a model.
6.1.4 Explain the concepts of overfitting and underfitting and how these problems can adversely affect the model.

A machine learning (ML) model is a mathematical and computational model that attempts to find a relationship between input variables and output (response) variables of a dataset. The way that the model “learns” is by adjusting internal parameters until the model meets a certain level of accuracy (defined as a measure of how correct the model is when used to predict. We will have more to say on accuracy later). What sets apart a machine learning model from a typical mathematical model is that the user does not have full control over all the parameters used by the model. Moreover, there may be hundreds, thousands, or even millions of parameters that are used and adjusted within a given ML model—way more than a human could reasonably track.

The part of that dataset that is used for initial learning is the training set (or data). There is often a testing set (or data) as well, which (as the name implies) can be used to determine whether the model is accurate enough. As a general rule of thumb, about 60–80% of available data will be used for training and the remaining data used for testing. After a machine learning model is trained, it can be used to make predictions as new input data is fed in and the responses produced. There may be multiple rounds of training and testing before the ML model is ready to make predictions or decisions.

The machine learning life cycle typically follows these steps:

Problem formulation/identification: Clearly state the problem you want the model to solve.
Data collection and preparation: Collect relevant data and clean it for analysis.
Feature selection/engineering: Choose important aspects of the data for the model.
Model/algorithm selection: Select a suitable machine learning algorithm for the task.
Model/algorithm training: Feed in the training data to teach the model patterns in the data.
Model validation: Test how well the model predicts outcomes.
Model implementation: Put the model to work in real-world scenarios.
Performance monitoring and enhancement: Keep an eye on the model’s accuracy and update as needed.
Continuous improvement and refinement: Repeat and refine steps based on feedback and changes.

Throughout the process, be sure to keep ethical considerations in mind to ensure fairness, privacy, and transparency. For example, were the data collected with the proper permissions? Does the training data accurately reflect the same diversity of data that the algorithm will ultimately be used on? Currently there are serious questions as to the ethical implications of using facial recognition algorithms in law enforcement, as training images may have been gathered from people without their explicit consent, and the models created often reflect biases that can result in discriminatory outcomes, disproportionately affecting certain demographic groups, such as people of color or women.

Bias—error introduced by overall simplistic or overly rigid models that do not capture important features of the data—can be hard to identify! One example that is famous in data science circles (though it may be apocryphal) is the “Russian tank problem” (Emspak, 2016). As the story goes, the United States military trained an early ML model to distinguish between Russian and US tanks. The training set consisted of photos of many different kinds of tanks used by both countries. Although the ML model produced very accurate results on the training and test sets, it performed terribly out in the field. Apparently, most of the pictures of Russian tanks were low quality and blurry, while those of the US tanks were high quality. They had inadvertently trained their ML model to distinguish between low- and high-quality photos regardless which county’s tanks were depicted in them!

Supervised vs. Unsupervised Learning

Data may be labeled or unlabeled. Labeled data has specific labels, categories, or classes associated with each data point. For example, a dataset containing the height, weight, cholesterol levels, and various other health-related measures for patients at a clinic, along with an indication as to whether each is susceptible to heart disease or not, would be labeled data. The label is the value (yes or no) of the variable “susceptible to heart disease.” Unlabeled data does not have a specific label, category, or class associated with its data points. For example, in product marketing, we might collect data representing customer attributes such as age, income, and purchase history, and we may want to identify distinct groups of customers with similar characteristics without knowing ahead of time what those groups (labels) will be.

Supervised learning uses labeled data in its training and testing sets. Parameters are adjusted in the model based on the correctness of the results as compared with the known labels. Unsupervised learning trains on unlabeled data. Often, both methods will be used in the same project. Unsupervised learning might be employed as an initial step to find patterns in the dataset, allowing the data to be labeled in some way. Then supervised learning techniques may be applied to the labeled data as a training step. The model created in this way may then be used to determine the best labels for future (unlabeled) data.

Supervised Learning

Supervised learning is analogous to a student learning from an instructor. At first, the student may answer many questions incorrectly, but the instructor is there to correct the student each time. When the student gives (mostly) correct responses, there is evidence that the student has sufficiently learned the material and would be able to answer similar questions correctly out in the real world where it really matters.

The goal of supervised learning is to produce a model that gives a mapping from the set of inputs or features to a set of output values or labels (see Figure 6.2). The model is function $y = f (X)$ , where $X$ is the input array and $y$ is the output value or label. The way that it does this is through training and testing. When training, the algorithm creates a correspondence between input features and the output value or label, often going through many iterations of trial and error until a desired level of accuracy is achieved. The model is then tested on additional data, and if the accuracy remains high enough, then the model may be deployed for use in real-world applications. Otherwise, the model may need to be adjusted substantially or abandoned altogether.

A flowchart illustrating the machine learning model development process. The flowchart has four steps: Step 1: Gather Data: This is the starting point, where data is collected for training the model. Step 2: Train: The collected data is used to train the model. Step 3: Test: The trained model is evaluated on a separate dataset to assess its accuracy. Step 4: Deploy Model: If the model's accuracy is satisfactory, it is deployed for use. The flowchart also includes a feedback loop, indicating that the model may need to be retrained and retested if its performance on new data is not satisfactory.

Figure 6.2 The Supervised Learning Cycle. The supervised learning cycle consists of gathering data, training on some part of the data, testing on another part of the data, and deployment to solve problems about new data.

Examples of supervised learning models include linear regression (discussed in Correlation and Linear Regression Analysis), logistic regression, naïve Bayes classification, decision trees, random forests, and many kinds of neural networks. Linear regression may not seem like a machine learning algorithm because there are no correction steps. A formula simply produces the line of best fit. However, the regression formula itself represents an optimization that reduces the error in the values predicted by the regression line versus the actual data. If the linear regression model is not accurate enough, additional data could be collected to improve the accuracy of the model, and once the regression line is found, it may then be used to predict values that were not in the initial dataset. Linear regression has been developed in previous chapters, and you will see logistic regression, Bayes methods, decision trees, and random forests later in this chapter.

Unsupervised Learning

Unsupervised learning is like building a puzzle without having the benefit of a picture to compare with. It is not impossible to build a puzzle without the picture, but it may be more difficult and time-consuming.

The main advantage of unsupervised learning is that the training data does not need to be labeled. All the data is considered input, and the output would typically be information about the inherent structure of the data. Common tasks are to determine clusters, patterns, and shapes within the dataset. For example, if the current locations in latitude and longitude of every person in the world were somehow known and collected into a dataset, then an algorithm may discover locations of high densities of people—hence learn of the existence of cities!

Unsupervised learning algorithms include k-means clustering, DBScan, and some kinds of neural networks. More generally, most clustering algorithms and dimension-reduction methods such as principal component analysis (PCA) and topological data analysis (TDA) fall into the category of unsupervised learning. You will encounter the k-means and DBScan algorithms in this chapter. The more advanced topics of PCA and TDA are outside the scope of this text.

Variations and Hybrid Models

Although we will only cover supervised and unsupervised models, it is important to note that there are some additional models of machine learning.

When some of the data in a dataset has labels and some does not, a semi-supervised learning algorithm may be appropriate. The labeled data can be run through a supervised learning algorithm to generate a predictive model that can then be used to label the unlabeled data, called pseudo-data. From this point, the dataset can be regarded as completely labeled data, which can then be analyzed using supervised learning.

Reinforcement learning introduces rewards for accuracy while also penalizing incorrectness. This model most closely follows the analogy of a student-teacher relationship. Many algorithms in artificial intelligence use reinforcement learning. While reinforcement generally improves accuracy, there can be issues in locking the model into a rigid state. For example, have you ever noticed that when you search online for a type of product, you begin to see advertisements for those kinds of products everywhere—even after you’ve purchased it? It’s as if the algorithm has decided that the only product you ever wish to purchase is what you have just searched for!

Training and Testing a Model

How do we know how good a model is? The key is to compare its output against known output. In supervised learning models, there is always a metric that measures efficacy of the model. During the training phase, this metric helps to determine how to adjust the model to make it more accurate. During the testing phase, the metric will be used to evaluate the model and determine if it is good enough to use on other datasets to make predictions. In this section, we will explore the supervised learning cycle in more detail.

Model Building

Suppose we are given a dataset with inputs or features $X$ and labels or values $y$ . Our goal is to produce a function that maps $y = f (X)$ . In practice, we do not expect that our model can capture the function $f$ with 100% accuracy. Instead, we should find a model $\hat{f}$ such that predicted values $\hat{y} = \hat{f} (X)$ are close enough to $y$ . In most cases, the metric we use to determine the fitness of our model $\hat{f}$ is some measure of error between $\hat{y}$ and $y$ (e.g., see the calculation of residuals of a linear regression in Correlation and Linear Regression Analysis). The way that error is measured is specific to the type of model that we are building and the nature of the outputs. When the outputs are labels, we may use a count or percentage of misclassifications. When the outputs are numerical, we may find the total distances (absolute, squared, etc.) between predicted and actual output values.

First, the data are separated into a training set and testing set. Usually about 60–80% of the data is used for training with the remaining amount used for testing. While training, parameters may be adjusted to reduce error. The result of the training phase will be the model $\hat{f}$ . Then, the model is tested on the testing set and error computed between predicted values and known outputs. If the model meets some predefined threshold of accuracy (low enough error), then the model can be used with some confidence to make predictions about new data.

Measures of Accuracy

As mentioned previously, it is important to measure how good our machine learning models are. The choice of a specific measure of accuracy depends on the nature of the problem. Here are some common accuracy measures for different machine learning tasks:

Classification
- Accuracy is the ratio of correct predictions to the total number of predictions.
- Precision $(p)$ is the ratio of true positive predictions $(T P)$ to the total number of positive (true positive plus false positive) predictions $(T P + F P)$ . This is useful when the response variable is true/false and we want to minimize false positives. $p = \frac{T P}{T P + F P}$
- Recall $(r)$ is the ratio of true positive predictions $(T P)$ to the total number of actual positives (true positive plus false negative $(T P + F N)$ predictions. This is useful when the response variable is true/false and we want to minimize false negatives. $r = \frac{T P}{T P + F N}$
- F1 Score is a combination of precision $(p)$ and recall $(r)$ . $F 1 = \frac{2 (p) (r)}{p + r} = \frac{2 T P}{2 T P + F P + F N}$
Regression (comparison of the predicted values, y^iy^i, and the actual values, yiyi)
- Mean absolute error (MAE): $\frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |$
- Mean absolute percentage error (MAPE): $\frac{1}{n} \sum_{i = 1}^{n} | \frac{y_{i} - {\hat{y}}_{i}}{y_{i}} |$
- Mean squared error (MSE): $\frac{1}{n} \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}$
- Root mean squared error (RMSE): $\sqrt{\frac{1}{n} \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}}$
- R-squared is the proportion of the variance in the response variable that is predictable from the input variable(s).

Note that MAE and RMSE, also discussed in Forecast Evaluation Methods, have the same units as the data itself. MSE and RMSE can be very sensitive to outliers because the differences are squared. MAPE (also introduced in Forecast Evaluation Methods) is often represented in percent form, which you can easily find by multiplying the raw MAPE score by 100%. The R-squared measure, which is discussed in Inferential Statistics and Regression Analysis in the context of linear regression, can be computed by statistical software packages, though we will not make significant use of it in this chapter. In all cases, the smaller the value of the error term, the more accurate the modeling would be.

Due to the tedious calculations involved in these measures, statistical software packages are typically used to compute them in practice. But it is important for every data scientist to understand how the formulas work so that results can be evaluated and interpreted correctly. Next, we will work out a small example by hand so you can have experience with the formulas.

Example 6.1

Problem

A test for COVID-19 was applied to 1,000 patients. Two hundred thirty-five tested positive for COVID-19, while the remaining 765 tested negative. Of those that tested positive, 198 turned out to carry the COVID-19 virus, and 37 did not have the virus. Of those that tested negative, 63 patients did in fact have the virus, while the remaining 702 did not. Compute the accuracy, precision, recall, and F1 scores for this experiment.

Solution

First, the total number of correct predictions were: 198 true positive predictions + 702 true negative predictions, for a total of $T P + T N = 900$ . So the accuracy is $\frac{900}{1,000} = 0.9 (90 %)$ .

For precision, we use $T P = 198$ and $F P$ (false positives) $= 37$ , and we have: $p = \frac{T P}{T P + F P} = \frac{198}{198 + 37} = 0.843$ (84.3%).

For recall, we use $T P = 198$ and $F N$ (false negatives) $= 63$ , and we have: $r = \frac{T P}{T P + F N} = \frac{198}{198 + 63} = 0.759$ (75.9%).

Finally, $F 1 = \frac{2 (p) (r)}{p + r} = \frac{2 (0.843) (0.759)}{0.843 + 0.759} = 0.799$ (79.9%).

Example 6.2

Problem

The score on an exam is positively correlated with the time spent studying for the exam. You have collected the following data (pairs of study time in hours and exam scores):

(9.8, 83.0), (8.2, 87.7), (5.2, 61.6), (8.0, 77.8), (2.1, 42.2), (6.8, 62.1), (2.3, 30.9), (9.5, 94.4), (6.6, 76.2), (9.5, 93.1)

Use 80% of the dataset to create a linear model and discuss the accuracy of the model on the remaining testing set.

Solution

Since there are 10 data points, we use 8 of them for training, as 8 is 80% of 10.

Training data: (9.8, 83.0), (8.2, 87.7), (5.2, 61.6), (8.0, 77.8), (2.1, 42.2), (6.8, 62.1), (2.3, 30.9), (9.5, 94.4)

The linear regression model that best fits the training data is $y = 21.7 + 7.1 x$ . (See Correlation and Linear Regression Analysis for details on linear regression.) We will compute MAE, MAPE, MSE, and RMSE for the remaining two data points (test set) as in Table 6.1.

x	Prediction	Actual Value	Difference (Error)
2.3	$21.7 + 7.1 (2.3) = 38.03$	30.9	-7.13
9.5	$21.7 + 7.1 (9.5) = 89.15$	94.4	5.25

Table 6.1 Test Set

$MAE = \frac{1}{2} (| - 7.13 | + | 5.25 |) = 6.19$
$MAPE = \frac{1}{2} (| \frac{- 7.13}{30.9} | + | \frac{5.25}{94.4} |) = 0.143$ , or 14.3%
MSE for cubic model: $\frac{1}{2} ({(- 7.13)}^{2} + {5.25}^{2}) = 39.2$
RMSE for cubic model: $\sqrt{39.2} = 6.26$

The MAE and RMSE both show that the model’s predictions are off by an average of about 6.2 points from actual values. The MAPE suggests that the model has an error of about 14.3% on average.

Overfitting and Underfitting

Can 100% accuracy be attained from a machine learning algorithm? In all but the most trivial exercises, the answer is a resounding no. Real-world data is noisy, often incomplete, and rarely amenable to a simple model. (Recall from Collecting and Preparing Data that we defined noisy data as data that contains errors, inconsistencies, or random fluctuations that can negatively impact the accuracy and reliability of data analysis and interpretation.) Two of the most common issues with machine learning algorithms are overfitting and underfitting.

Overfitting happens when the model is adjusted to fit the training data too closely, resulting in a complex model that is too specific to the training data. When such a model is given data outside of the training set, it may perform worse, a phenomenon known as high variance.

Underfitting occurs when the model is not robust enough to capture essential features of the data. This could be because of too small a dataset to train on, choosing a machine learning algorithm that is too rigid, or a combination of both. A model that suffers from underfitting may be said to have high bias.

Often, a model that has low variance will have high bias, and a model that has low bias will have high variance. This is known as the bias-variance trade-off.

Overfitting

It might at first seem desirable to produce a model that makes no mistakes on the training data; however, such “perfect” models are usually terrible at making predictions about data not in the training set. It is best to explain this idea by example.

Suppose we want to create a model for the relationship between x and y based on the given dataset,

S = (x, y) = (2, 10), (4, 9), (6, 6), (8, 7), (10, 4), (12, 5)

How would we go about creating the model? We might first split S into a training set, $S_{train} = {(2, 10), (4, 9), (6, 6), (8, 7)}$ , and testing set, $S_{test} = {(10, 4), (12, 5)}$ . It can be shown that the cubic equation, $\hat{y} = \frac{1}{8} x^{3} - \frac{7}{4} x^{2} + \frac{13}{2} x + 3$ , fits the training data perfectly. However, there is no reason to expect that the relationship between x and y is truly cubic. Indeed, the predictions of the cubic model on $S_{test}$ are terrible! See Table 6.2.

x	Prediction, Cubic Model	Actual Value, y
10	18	4
12	45	5

Table 6.2 Cubic Model Predictions

The linear model (linear regression) that represents a best fit for the first four data points is $\hat{y} = - 0. 6 x + 11$ . Figure 6.3 shows the data points along with the two models.

A line graph of cubic and linear models for a set of points. The Y axis ranges from 0 to 20. The X axis ranges from 0 to 14. A blue linear line declines from left to right from about 11 to 2.5. An orange line forming organic cubic curve, intersects blue linear line at four data points (2,10), (4, 9), (6, 6), (8, 7) .Two additional data points (10, 4) and (12, 5) exists on the linear line.

Figure 6.3 Data Points along a Cubic Model and a Linear Model. The four data points in

S_{train}

can be fitted onto a cubic curve (0% error), but it does poorly on additional data from the testing data,

S_{test}

. The cubic curve overfits the training data. The regression line does not fit the training data exactly; however, it does a better job predicting additional data.

To get a good sense of how much better the linear model is compared to the cubic model in this example, we will compute the MAPE and RMSE for each model, as shown in Table 6.3.

$x_{i}$	Actual Value, $y_{i}$	Prediction, $y_{i}$ , Cubic Model	$\| y_{i} - {\hat{y}}_{i} \|$ , Cubic Model	Prediction, ${\hat{y}}_{i}$ , Linear Model	$\| y_{i} - {\hat{y}}_{i} \|$ , Linear Model
2	10	10	0	9.8	0.2
4	9	9	0	8.6	0.4
6	6	6	0	7.4	1.4
8	7	7	0	6.2	0.8
10	4	18	14	5	1
12	5	45	40	3.8	1.2

Table 6.3 MAPE and RMSE Calculated for Each Model

MAPE for cubic model: $\frac{1}{6} (0 / 10 + 0 / 9 + 0 / 6 + 0 / 7 + 14 / 4 + 40 / 5) = 1.917$ , or $191.7 %$
RMSE for cubic model: $\sqrt{\frac{1}{6} (0^{2} + 0^{2} + 0^{2} + 0^{2} + 14^{2} + 40^{2})} = 17.3$
MAPE for linear model: $\frac{1}{6} (0.2 / 10 + 0.4 / 9 + 1.4 / 6 + 0.8 / 7 + 1 / 4 + 1.2 / 5) = 0.15$ , or $15 %$
RMSE for linear model: $\sqrt{\frac{1}{6} ({0.2}^{2} + {0.4}^{2} + {1.4}^{2} + {0.8}^{2} + 1^{2} + {1.2}^{2})} = 0.93$

The MAPE and RMSE values for the linear model are much lower than their respective values for the cubic model even though the latter predicts four values with 100% accuracy.

Underfitting

A model that is underfit may perform well on some input data and poorly on other data. This is because an underfit model fails to detect some important feature(s) of the dataset.

Example 6.3

Problem

A student recorded their time spent on each homework set in their statistics course. Each HW set was roughly similar in length, but the student noticed that they spent less and less time completing them as the semester went along. Find the linear regression model for the data in Table 6.4 and discuss its effectiveness in making further predictions.

HW Set	1	2	3	4	5	6
Time (hr)	9.8	5.3	3.4	2.6	1.9	1.7

Table 6.4 Homework Set Completion Times

Solution

There is a definitive downward trend in the data. A simple linear regression model for this data is:

\hat{y} = - 1.47143 x + 9.26667

However, the line of best fit does not seem to fit the data all that well. There is no way that a linear function can bend the way the data points seem to do in this example (see Figure 6.4).

A line graph labeled “time spend on HW problem sets.” The X axis is labeled “Problem Set” and ranges from 0 to 8. The Y axis is labeled “time spent (hours)” and ranges from -2 to 12. The graph contains a scatterplot of six data points with the linear model that best fits the data—a straight blue line that declines from 9 to -1. A blue line is fitted to the data, and the point (7, -1) is highlighted in orange.

Figure 6.4 Linear Model That Best Fits the Data in Example 6.3. This graph contains a scatterplot of the six data points together with the linear model that best fits the data. This is a clear case of underfitting.

Another issue with using the linear model for the data in Table 6.4 is that eventually the values of $\hat{y}$ become negative! According to this model, it will take about $- 1$ hours to complete the 7th homework set, which is absurd. Perhaps a decreasing exponential model, shown in Figure 6.5, would work better here. Proper model selection is the most effective way of dealing with underfitting.

A line graph labeled “time spend on HW problem sets (exponential model).” The X axis is labeled “Problem Set” and ranges from 0 to 8. The Y axis is labeled “time spent (hours)” and ranges from -2 to 12. The graph contains a scatterplot of six data points with the linear model that best fits the data—a curved blue line that declines from 11 to 1. The point (7, -1) is highlighted in orange

Figure 6.5 Exponential Model That Best Fits the Data in Example 6.3. This graph contains a scatterplot of the six data points together with the exponential model that best fits the data. Under this model, the prediction of

\hat{y} = 1

when

x = 7

seems much more reasonable.

$x_{i}$	Actual Value, $y_{i}$	Prediction, $y_{i}$ , Cubic Model	$\| y_{i} - {\hat{y}}_{i} \|$ , Cubic Model	Prediction, ${\hat{y}}_{i}$ , Linear Model	$\| y_{i} - {\hat{y}}_{i} \|$ , Linear Model
2	10	10	0	9.8	0.2
4	9	9	0	8.6	0.4
6	6	6	0	7.4	1.4
8	7	7	0	6.2	0.8
10	4	18	14	5	1
12	5	45	40	3.8	1.2

$x_{i}$	Actual Value, $y_{i}$	Prediction, $y_{i}$ , Cubic Model	$\| y_{i} - {\hat{y}}_{i} \|$ , Cubic Model	Prediction, ${\hat{y}}_{i}$ , Linear Model	$\| y_{i} - {\hat{y}}_{i} \|$ , Linear Model
2	10	10	0	9.8	0.2
4	9	9	0	8.6	0.4
6	6	6	0	7.4	1.4
8	7	7	0	6.2	0.8
10	4	18	14	5	1
12	5	45	40	3.8	1.2

6.1 What Is Machine Learning?

Learning Outcomes

Supervised vs. Unsupervised Learning

Supervised Learning

Unsupervised Learning

Variations and Hybrid Models

Training and Testing a Model

Model Building

Measures of Accuracy

Problem

Solution

Problem

Solution

Overfitting and Underfitting

Overfitting

Underfitting

Problem

Solution

$x_{i}$	Actual Value, $y_{i}$	Prediction, $y_{i}$ , Cubic Model	$\| y_{i} - {\hat{y}}_{i} \|$ , Cubic Model	Prediction, ${\hat{y}}_{i}$ , Linear Model	$\| y_{i} - {\hat{y}}_{i} \|$ , Linear Model
2	10	10	0	9.8	0.2
4	9	9	0	8.6	0.4
6	6	6	0	7.4	1.4
8	7	7	0	6.2	0.8
10	4	18	14	5	1
12	5	45	40	3.8	1.2