Dr. Shaun V. Ault; Dr. Soohyun Nam Liao; Larry Musolino

Learning Outcomes

By the end of this section, you should be able to:

6.5.1 Discuss the concept of a random forest as a bootstrapping method for decision trees.
6.5.2 Create a random forest model and use it to classify data.
6.5.3 Define conditional probability and explain prior probabilities for training datasets.
6.5.4 Produce a (multinomial) naïve Bayes classifier and use it to classify data.
6.5.5 Discuss the concept of a Gaussian naïve Bayes classifier.
6.5.6 Describe some methods for working with big data efficiently and effectively.

So far, we have encountered a number of different machine learning algorithms suited to particular tasks. Naturally, the realm of machine learning extends well beyond this small survey of methods. This section describes several more machine learning techniques, though in less detail. Methods such as random forests, naïve Bayes classifiers, and a refinement of the latter called Gaussian naïve Bayes, offer distinctive approaches to solving complex problems. Additionally, this section provides insights into strategies for handling vast datasets, presenting a comprehensive survey of methods tailored for the unique challenges posed by Big Data.

Random Forests

In Decision Trees, we constructed decision trees and discussed some methods, such as pruning, that would improve the reliability of predictions of the decision tree. For each classification problem, one tree is created that serves as the model for all subsequent purposes. But no matter how much care is taken in creating and pruning a decision tree, at the end of the day, it is a single model that may have some bias built into its very structure due to variations in the training set. If we used a different training set, then our decision tree may have come out rather differently.

In the real world, obtaining a sufficient amount of data to train a machine learning model may be difficult, time-consuming, and expensive. It may not be practical to go out and find additional training sets just to create and evaluate a variety of decision trees. A random forest, a classification algorithm that uses multiple decision trees, serves as a way to get around this problem. Similar to what we did for linear regressions in Machine Learning in Regression Analysis, the technique of bootstrap aggregating is employed. In this context, bootstrap aggregating (bagging) involves resampling from the same testing dataset multiple times, creating a new decision tree each time. The individual decision trees that make up a random forest model are called weak learners. To increase the diversity of the weak learners, there is random feature selection, meaning that each tree is built using a randomly selected subset of the feature variables. For example, if we are trying to predict the presence of heart disease in a patient using age, weight, and cholesterol levels, then a random forest model will consist of various decision trees, some of which may use only age and weight, while others use age and cholesterol level, and still others may use weight alone. When classifying new data, the final prediction of the random forest is determined by “majority vote” among the individual decision trees. What’s more, the fact that individual decision trees use only a subset of the features means that the importance of each feature can be inferred by the accuracy of decision trees that utilize that feature.

Because of the complexity of this method, random forests will be illustrated by way of the following Python code block.

Suppose you are building a model that predicts daily temperatures based on such factors as the temperature yesterday and the day before as well forecasts from multiple sources, using the dataset temps.csv. Since there are a lot of input features, it will not be possible to visualize the data. Moreover, some features are likely not to contribute much to the predictions. How do we sort everything out? A random forest model may be just the right tool. The Python library sklearn has a module called sklearn.ensemble that is used to create the random forest model.

Python Code

      # Import libraries
      import pandas as pd  ## for dataset management
      from sklearn.model_selection import train_test_split
      import sklearn.ensemble as ens
      
      # Read input file
      features = pd.read_csv('temps.csv').dropna()
      
      # Use 'actual' as the response variable
      labels = features['actual']
      
      # Convert text data into numerical values
      # This is called "one-hot" encoding
      features = pd.get_dummies(features)
      
      # The other columns are used as features
      features = features.drop('actual', axis=1)
      feature_list = list(features.columns)
      
      # Split data into training and testing sets
      train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.25)
      
      # Create random forest model
      rf = ens.RandomForestRegressor(n_estimators=1000)
      rf.fit(train_features, train_labels)

The resulting output will look like this:

Output from Python code that says “RandomForestRegressor” on the top line with a light blue background and “RandomForestRegressor(n_estimators=1000) on the bottom line.”

The preceding Python code reads in the dataset temps.csv, identifying the column “actual” (actual temperatures each day) as the variable to be predicted. All the other columns are assumed to be features. Since the “week” column contains categorical (text) data, it needs to be converted to numerical data before any model can be set up. The method of one-hot encoding does the trick here. In one-hot encoding, each category is mapped onto a vector containing a single 1 corresponding to that category while all other categories are set to 0. In our example, each day of the week maps to a seven-dimensional vector as follows:

Monday: [1, 0, 0, 0, 0, 0, 0]
Tuesday: [0, 1, 0, 0, 0, 0, 0]
Wednesday: [0, 0, 1, 0, 0, 0, 0]
Thursday: [0, 0, 0, 1, 0, 0, 0]
Friday: [0, 0, 0, 0, 1, 0, 0]
Saturday: [0, 0, 0, 0, 0, 1, 0]
Sunday: [0, 0, 0, 0, 0, 0, 1]

The dataset is split into training (75%) and testing (25%) sets, and the random forest model is trained on the training set, as seen in this Python code block:

Python Code

      import statistics as stats # to compute the mean

      # Get predictions and compute average error
      predictions = rf.predict(test_features)
      errors = abs(predictions - test_labels)
      round(np.mean(errors),2)

The resulting output will look like this:

3.91

With a mean absolute error of only 3.91, predicted temperatures are only off by about 4°F on average, so the random forest seems to do a good job with the test set, given that it is notoriously difficult to predict weather data since it typically shows high variance. Now let’s find out which features were most important in making predictions, which is stored in rf.feature_importances_. Note the number of Python commands used solely for sorting and formatting the results, which can be safely ignored in this snippet of code.

Python Code

    # Find the importance of each feature
    importances = list(rf.feature_importances_)
    feature_importances = [(feature, round(importance,2)) for feature, importance in zip(feature_list, importances)]
    feature_importances = sorted(feature_importances, key=lambda x: x[1], reverse=True)
    [print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

The resulting output will look like this:

Variable: temp_1               Importance: 0.62
Variable: average              Importance: 0.19
Variable: forecast_acc         Importance: 0.06
Variable: forecast_noaa        Importance: 0.04
Variable: day                  Importance: 0.02
Variable: temp_2               Importance: 0.02
Variable: forecast_under       Importance: 0.02
Variable: friend               Importance: 0.02
Variable: month                Importance: 0.01
Variable: year                 Importance: 0.0
Variable: week_Fri             Importance: 0.0
Variable: week_Mon             Importance: 0.0
Variable: week_Sat             Importance: 0.0
Variable: week_Sun             Importance: 0.0
Variable: week_Thurs           Importance: 0.0
Variable: week_Tues            Importance: 0.0
Variable: week_Wed             Importance: 0.0

As we can see, the single best predictor of daily temperatures is the temperature on the previous day (“temp_1”), followed by the average temperature on this day in previous years (“average”). The forecasts from NOAA and ACC showed only minor importance, while all other feature variables were relatively useless in predicting current temperatures. It should come as no surprise that the day of the week (Monday, Tuesday, etc.) played no significant part in predicting temperatures.

Exploring Further

Using Random Forests for Facial Recognition

As noted in this chapter’s introduction, facial recognition is an important and widely used application of machine learning—and an example of a classification task. Facial recognition involves categorizing or labeling images of faces based on the identities of individuals depicted in those images, which is a multiclass classification task. (In its simplest form, when the task is to determine whether a given image contains the face of a particular individual or not, this is considered a binary classification task.) When the data consist of many images of faces, each containing hundreds or thousands of features, using random forests would be appropriate. Check out this Colab Notebook, written by Michael Beyeler, a neuroscientist at the University of California, Santa Barbara, which steps through the process of setting up and using a random forest model for facial recognition in Python.

Multinomial Naïve Bayes Classifiers

One method for classifying data is to look for commonly occurring features that would distinguish one data point from another. For example, imagine you have a vast collection of news articles and you want to automatically categorize them into topics such as sports, politics, business, entertainment, and technology. You may have noticed certain patterns, such as sport articles tend to mention team names and give numerical scores, while articles about politics mention the words “Democrat” and “Republican” fairly often. Business articles might use words such as “outlook” and “downturn” much more often than entertainment or technology articles would. Multinomial naïve Bayes can be applied to classify these articles based on the frequency and distribution of words typically found in each kind of article.

The multinomial naïve Bayes classification algorithm is based on Bayes’ Theorem, making use of prior probabilities and Bayes’ Theorem to predict the class or label of new data. (See Normal Continuous Probability Distributions for more details.) The naïve Bayes classifier algorithm is a powerful tool for working out probabilities of events based on prior knowledge. Recall, the notation $P (A | B)$ stands for the conditional probability that event $A$ occurs given that event $B$ is known to occur (or has already occurred or is presumed to occur). In the simplest version, Bayes’ Theorem takes the form

P (A | B) = \frac{P (A) P (B | A)}{P (B)}

Here, we regard $A$ as an event that we would like to predict and $B$ as information that has been given to us (perhaps as a feature of a dataset). The conditional probability $P (B | A)$ represents the known likelihood of seeing event $B$ in the case that A occurs, which is typically found using the training data. The value of $P (A)$ would be computed or estimated from training data as well and is known as the prior probability.

Example 6.11

Problem

Three companies supply batteries for a particular model of electric vehicle. Acme Electronics supplies 25% of the batteries, and 2.3% of them are defective. Batteries ‘R’ Us supplies 36% of the batteries, and 1.7% of them are defective. Current Trends supplies the remaining 39% of the batteries, and 2.1% of them are defective. A defective battery is delivered without information about its origin. What is the probability that it came from Acme Electronics?

Solution

First, we find the probability that a delivered battery is defective. Let $A$ , $B$ , and $C$ stand for the proportions of batteries from each of the companies, Acme Electronics, Batteries ‘R’ Us, and Current Trends, respectively. Then $P (A) = 0.25, P (B) = 0.36, P (C) = 0.39$ . We also have the probabilities of receiving a defective battery (event $d$ ) from each company, $P (d | A) = 0.023, P (d | B) = 0.017, P (d | C) = 0.021$ . Now, the probability of receiving a defective battery from any of the companies is the sum

\begin{array}{rcl} P (d) & = & P (A) P (d | A) + P (B) P (d | B) + P (C) P (d | C) \\ = & (0.25) (0.023) + (0.36) (0.017) + (0.39) (0.021) = 0.02 \end{array}

Using Bayes’ Theorem, the probability that the defective battery came from Acme is:

P (A | d) = \frac{P (A) P (d | A)}{P (d)} = \frac{(0.25) (0.023)}{0.02} = 0.29 = 29 %

In practice, there are often a multitude of features, $B_{1}, B_{2}, B_{3}, \dots B_{n}$ , where each feature could be the event that a particular word occurs in a message, and we would like to classify the message into some type or another based on those events. In other words, we would like to know something about $P (A | B_{1}, B_{2}, B_{3}, \dots, B_{n})$ , the probability of A occurring if we know that events $B_{1}, B_{2}, B_{3}, \dots B_{n}$ occurred. In fact, we are just interested in distinguishing event $A$ from $A'$ rather than precise computations of probabilities, and so we will do simple comparisons to find out this information. The process is best illustrated by example.

Example 6.12

Problem

A survey of 1,000 news articles was conducted. It is known that 600 were authentic articles and 400 were fake. In the real articles, 432 contained the word today, 87 contained the word disaster, and 303 contained the word police. In the fake articles, 124 contained the word today, 320 contained the word disaster, and 230 contained the word police. Predict whether a new article is real or fake if it (a) contains the word “disaster” or (b) contains the words “today” and “police.”

Solution

First, find the prior probabilities that a news article is real or fake based on the proportion of such articles in the training set. $P (Real) = \frac{600}{1,000} = 0.6$ and $P (Fake) = \frac{400}{1000} = 0.4$ . Next, using the word counts found in the training data, find the conditional probabilities:

$P (today|Real) = \frac{432}{600} = 0.72$ , $P (disaster|Real) = \frac{87}{600} = 0.145$ , $P (police|Real) = \frac{303}{600} = 0.505$

$P (today|Fake) = \frac{124}{400} = 0.31$ , $P (disaster|Fake) = \frac{320}{400} = 0.8$ , $P (disaster|Fake) = \frac{230}{400} = 0.575$

Now we will only use the numerator of Bayes’ formula, which is proportional to the exact probability. This will produce scores that we can compare. For part (a):

$P (Real|disaster)$ score: $P (Real) P (disaster|Real) = (0.6) (0.145) = 0.087$

$P (Fake|disaster)$ score: $P (Fake) P (disaster|Fake) = (0.4) (0.8) = 0.32$

Since the score for Fake is greater than the score for Real, we would classify the article as fake.

For part (b), the process is analogous. Note that probabilities are multiplied together when there are multiple features present.

$P (Real|today, police)$ score: $P (Real) P (today | Real) P (police|Real)$

= (0.6) (0.72) (.505) = 0.22

$P (Fake|today, police)$ score: $P (Fake) P (today|Fake) P (police|Fake)$

= (0.4) (0.31) (.575) = 0.071

Here the score for Real is greater than that of Fake, so we conclude the article is real.

Note: Naïve Bayes classification assumes that all features of the data are independent. While this condition may not necessarily be true in real-world datasets, naïve Bayes may still perform fairly well in real-world situations. Moreover, naïve Bayes is a very simple and efficient algorithm, and so it is often the first method used in a classification problem before more sophisticated models are employed.

Gaussian Naïve Bayes for Continuous Probabilities

Gaussian naïve Bayes is a variant of the naïve Bayes classification algorithm that is specifically designed for data with continuous features. Unlike the standard (multinomial) naïve Bayes, which is commonly used to classify text data with discrete features, Gaussian naïve Bayes is suitable for datasets with features that follow a Gaussian (normal) distribution (You may recall the discussion of the normal distribution in Discrete and Continuous Probability Distributions). It's particularly useful for problems involving real-valued, continuous data in which the feature variables are relatively independent.

In this example, we will work with a large dataset, cirrhosis.csv, that has many features, both categorical and numerical. This dataset (along with many more!) is available for free download from Kaggle.com. We will only use four of the numerical columns as features. The response variables is “Status,” which may take the values $D$ for death, $C$ for censored (meaning that the patient did not die during the observation period), and CL for censored due to liver transplantation.

Python Code

      # Import libraries
      import pandas as pd  ## for dataset management
      from sklearn.model_selection import train_test_split
      from sklearn.naive_bayes import GaussianNB
      from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
      
      # Load the dataset
      data = pd.read_csv('cirrhosis.csv').dropna()
      
      # Choose feature columns and target labels
      X = data[['Bilirubin','Cholesterol','Albumin','Copper']]
      y = data['Status']
      
      # Split the dataset into training and testing sets (80% training, 20% testing)
      X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
      
      # Initialize and train the Gaussian naive Bayes classifier
      gnb_classifier = GaussianNB()
      gnb_classifier.fit(X_train, y_train)
      
      # Predict labels for the test set
      y_pred = gnb_classifier.predict(X_test)
      
      # Evaluate the classifier's performance
      accuracy = accuracy_score(y_test, y_pred)
      confusion = confusion_matrix(y_test, y_pred, labels=['D','C','CL'])
      report = classification_report(y_test, y_pred)
      
      print(f"Accuracy: {accuracy:.2f}")
      print("Confusion matrix with label order D, C, CL:\n", confusion)

The resulting output will look like this:

Accuracy: 0.71
Confusion matrix with label order D, C, CL:
[[10 14  2]
[ 0 30  0]
[ 0  0  0]]

The accuracy score of 0.71 means that correct labels were assigned 71% of the time. We can see that 10 deaths ( $D$ ) and 30 non-deaths ( $C$ or $C L$ ) were corrected predicted. However, 16 predictions of death were ultimately non-deaths. This is a much more desirable error than the reverse (prediction of non-death when in fact the patient will die), so the model seems to be appropriate for the task of predicting cirrhosis fatalities.

Working with Big Data

In the rapidly evolving landscape of machine learning, the concept of big data looms large. As we saw in Handling Large Datasets, large datasets, often termed big data, refers to extremely large and complex datasets that are beyond the capacity of traditional data processing and analysis tools to efficiently handle, manage, and analyze. Big data is characterized by the “three Vs”:

Volume: Big data involves massive amounts of data, often ranging from terabytes ( $2^{40} \approx 10^{12}$ bytes) to as large as petabytes ( $2^{50} \approx 10^{15}$ bytes) and beyond. Where does such vast amounts of data originate? One example is the Common Crawl dataset, which contains petabytes of data consisting of the raw web page data, metadata, and other information available from the internet. It is likely that there will be datasets containing exabytes ( $2^{60} \approx 10^{18}$ bytes) of information in the near future.
Velocity: In some important real-world applications, data is generated and collected at an incredibly high speed. This continuous flow of data may be streamed in real time or stored in huge databases for later analysis. One example is in the area of very high-definition video, which can be captured, stored, and analyzed at a rate of gigabits per hour.
Variety: Big data also encompasses diverse types of data, including structured data (e.g., databases), unstructured data (e.g., text, images, videos), and anything in between. This variety adds considerable complexity to data management and analysis. Consider all the various sources of data and how different each may be in transmission and storage. From social media interactions and e-commerce transactions to sensor data from Internet of Things (IoT) devices, data streams in at a pace, scale, and variety never before seen. Moreover, few will be able to predict the brand-new ways that data will be used and gathered in the future!

Data Cleaning and Mining

The most time-consuming aspect of data science has traditionally been data cleaning, a process we discussed in Data Cleaning and Preprocessing. Datasets are often messy, missing data, and likely to contain errors or typos. These issues are vastly compounded when considering big data. While it may take a person an hour or so to look through a dataset with a few hundred entries to spot and fix errors or deal with incomplete data, such work becomes impossible to do by hand when there are millions or billions of entries. The task of data cleaning at scale must be handled by technology; however, traditional data cleaning tools may be ill-equipped to handle big data because of the sheer volume. Fortunately, there are tools that can process large amounts of data in parallel.

Data Mining

Data mining is the process of discovering patterns, trends, and insights from large datasets. It involves using various techniques from machine learning, statistics, and database systems to uncover valuable knowledge hidden within data. Data mining aims to transform raw data into actionable information that can be used for decision-making, prediction, and problem-solving and is often the next step after data cleaning. Common data mining tasks include clustering, classification, regression, and anomaly detection. For unlabeled data, unsupervised machine learning algorithms may be used to discover clusters in smaller samples of the data, which can then be assigned labels that would be used for more far-ranging classification or regression analysis.

Exploring Further

Tools for Data Mining

Tools for large-scale data mining include Apache Spark, which uses a distributed file system (HDFS) to store and process data across clusters and also relies on in-memory processing, and Apache Flink, which enables real-time data analytics. See this Macrometa article for a comparison of the two programs.

Methods for Working with Big Data

Machine learning in the era of big data calls for scalability and adaptability. Traditional analysis methods, like linear regression, logistic regression, decision trees, and Bayesian classifiers, remain valuable tools. However, they often benefit from enhancements and new strategies tailored for big data environments:

Parallel processing: Techniques that distribute computation across multiple nodes or cores, like parallelized decision trees and ensemble methods, can significantly speed up model training.
Feature selection: Not all features of a huge dataset may be used at once. Advanced feature selection algorithms are necessary to isolate only the most significant features useful for predictions.
Progressive learning: Some big data scenarios require models that can learn incrementally from data streams. Machine learning algorithms will need to adapt to changing data in real time.
Deep learning: Deep neural networks, with their capacity to process vast volumes of data and learn intricate patterns, have become increasingly important in big data applications. We will delve into this topic in Deep Learning and AI Basics.
Dimensionality reduction: Techniques like principal component analysis (PCA) help reduce the dimensionality of data, making it more manageable while retaining essential information. (Note: Dimension reduction falls outside the scope of this text.)

When the volume of data is large, the simplest and most efficient algorithms would typically be chosen, in which speed is more important than accuracy. For example, spam filtering for email often utilizes multinomial Bayes’ classification. In the age of instant digital communication, your email inbox can quickly become flooded with hundreds of meaningless or even malicious messages (i.e., spam) in a day, and so it is very important to have a method of classifying and removing the spam. Algorithms easily scale to work on big data, while other methods such as decision trees and random forests require significant overhead and would not scale up as easily.

Datasets

Note: The primary datasets referenced in the chapter code may also be downloaded here.

6.5 Other Machine Learning Techniques

Learning Outcomes

Random Forests

Using Random Forests for Facial Recognition

Multinomial Naïve Bayes Classifiers

Problem

Solution

Problem

Solution

Gaussian Naïve Bayes for Continuous Probabilities

Working with Big Data

Data Cleaning and Mining

Data Mining

Tools for Data Mining

Methods for Working with Big Data