Dr. Shaun V. Ault; Dr. Soohyun Nam Liao; Larry Musolino

Learning Outcomes

By the end of this section, you should be able to:

6.4.1 Measure information using entropy.
6.4.2 Understand how entropy can be used to classify data.
6.4.3 Build a top-down decision tree classifier.
6.4.4 Use decision trees to classify data and make predictions.

How do you make decisions? Often, we choose one product over another based on a combination of features. For instance, imagine you are considering purchasing a new smartphone. You might compare different models based on criteria such as camera quality, battery life, processing speed, and price. If there are many options available, you might organize your decision process into a flow chart or tree.

As a simple example, consider the game of 20 Questions. One person thinks up a person, place, or thing, and the others ask questions in turn to try to narrow down the space of possible answers until there is enough information to identify the object. Suppose that there are only five objects to choose from (a very simple game of 20 Questions indeed!): Tree, Dog, Human, Rock, and Diamond. The diagram in Figure 6.18, which may be called a decision tree, shows the questions and responses that would identify all of these items.

A decision tree. The top box says “is it living?” and branches down to “yes” or “no.” The “yes” box branches to “Is it an animal?” with “no” leading to “tree” and “yes” leading to “does it walk on two legs?” From there, “yes” leads to “human” and “no” leads to “dog.” The original “no” branch leads to “is it worth a lot of money?” with “yes” leading to “diamond” and “no” leading to “rock.”

Figure 6.18 A Simple Decision Tree

In machine learning, a decision tree is an algorithm used for both classification and regression tasks, offering a visual and intuitive approach to solving complex problems using treelike structures to keep track of decisions based on the features of the dataset. Decision trees combine simplicity and flexibility in data analysis. They are simple structures to understand and create, inspired by the way our minds make choices. Decision trees break down decisions into a series of straightforward choices, much like following a trail of breadcrumbs through a dense forest. Their flexibility lies mainly in their nonlinear nature. Instead of trying to fit data to a linear model or some other rigid structure, the decision tree classifies data using a series of choices that depend on other choices, which may in turn depend on other choices, etc.

In this section, we will introduce information theory and entropy—a measure of information that is useful in constructing and using decision trees, illustrating their remarkable power while also drawing attention to potential pitfalls.

Information and Entropy

Information theory provides a framework for measuring and managing the uniqueness of data, or the degree of surprise or uncertainty associated with an event or message. For example, if I want to send a message consisting entirely of the letter “A” repeated 1,000 times, I do not have to send 1,000 “A”s. In fact, the phrase “1,000A” does the trick in only 5 characters (assuming the recipient of the message interprets it correctly). A related concept, entropy, is a measure of the amount of information in a message, structure, or system, as we will explore in the next section.

Information

Consider an event that has a certain probability $p$ . The higher $p$ is, the more likely the event is, and hence there is not much information (or uncertainty) contained in the occurrence of the event. For example, the event that my car starts when I get into it in the morning has a fairly low amount of information because my car is reliable, and I expect it to start. If, on the other hand, it does not start one morning, then that low-probability event would be surprising and would contain much more information.

Information is typically measured in bits, where one bit represents the amount of information needed to distinguish between two equally likely events or messages. For example, when you flip a fair coin, you gain one bit of information because you go from being uncertain about the outcome (heads or tails) to knowing the outcome (heads, for instance). To measure the number of bits of a whole number, you would use the base-2 logarithm ( ${log}_{2} x$ ), so it may not come as a surprise to learn that the base-2 logarithm plays a huge role in measuring information.

The information of an event with probability $p$ is $- {log}_{2} p$ . Recall that probabilities $p$ are always between 0 and 1, and if $0 < p < 1$ , then by properties of logarithms, we would have ${log}_{2} p < 0$ . This explains the negative sign that appears in the formula, as it makes sense to measure information with a positive number.

Note that the information contained in an event that is certain to occur $(p = 1)$ is $- {log}_{2} 1 = 0$ , while the information contained in knowing that a coin was flipped and heads came up $(p = 1 / 2)$ , would be $- {log}_{2} (\frac{1}{2}) = - (- 1) = 1$ , or one bit of information.

Let’s say the probability that my car will start on any given day is $p = 255 / 256$ , and so the probability that it will not start is $1 - p = 1 / 256$ . The information of the event that it starts is $- {log}_{2} (\frac{255}{256}) = 0.0056$ , whereas the information of the event that it does not start is $- {log}_{2} (\frac{1}{256}) = 8$ . Essentially, there are 8 bits of information contained in the knowledge that my car does not start because this would only have happened about once in every $2^{8} = 256$ days.

Entropy

Entropy is a measure that quantifies the average amount of information, or uncertainty, in a random variable or a probability distribution. It is closely related to information. Think of entropy as a measure of the disorder, randomness, or unpredictability in a system.

For a discrete random variable $X$ , with $n$ values $x_{i}$ and associated probabilities $p_{i}$ the entropy of $X$ is:

H (X) = - \sum_{i = 1}^{n} p_{i} {log}_{2} p_{i}

Example 6.7

Problem

Compute the entropy of the system described in the previous example about the probability of a car starting on any given day.

Solution

Recall, the probabilities are: $\frac{255}{256}$ for “car starts” and $\frac{1}{256}$ for “car does not start.”

H (X) = - [\frac{255}{256} {log}_{2} (\frac{255}{256}) + \frac{1}{256} {log}_{2} (\frac{1}{256})] = - (- 0.00562 - 0.03125) = 0.03687

The low value of entropy, 0.03687, indicates that the outcome of the event is highly predictable. This makes sense since the probability of the car starting is $\frac{255}{256} = 0.996$ , or 99.6%, which implies that we could predict that the car starts and be correct 99.6% of the time.

The idea of entropy is crucial in creating decision trees, as we shall see in the next section.

Building a Decision Tree

Suppose that you wish to classify data into some number of categories based on values of its features (inputs). We will assume that you have plenty of labeled data to train on. For the purposes of illustration, let’s say that your data consists of a set $S$ of points in the plane with labels Red and Blue. The format of each point is (X, Y, Label). Here is your training set.

(18.4, 3.3, Red) (10.1, 6.3, Blue) (9.1, 14.7, Blue) (16.3, 5, Red)

(12.8, 1.9, Blue) (6.9, 2.8, Blue) (3.2, 11, Blue) (17.7, 1.3, Red)

(8.6, 18.3, Blue) (16.9, 5, Red) (1.9, 13, Red) (12.6, 6.7, Blue)

(7.9, 10.6, Red) (14.6, 14.5, Blue) (9.4, 12.8, Blue) (7.6, 10.2, Red)

(19.6, 16.4, Blue) (19.5, 4, Red) (5.2, 12.2, Red)

A colored scatterplot of the dataset S is shown in Figure 6.19.

A scatterplot with 19 data points—9 points are red and 10 are blue. The X and Y axes both range from 0 to 20

Figure 6.19 Labeled Scatterplot of Dataset

S

. Note the pattern of red and blue that would be impossible to predict from a typical linear or logistic regression.

Notice in Figure 6.19 that the red points fall into roughly two clusters that are bordered by blue points. In order to capture this feature, you will need a more flexible tool than linear or logistic regression. We will set up a decision tree!

As discussed earlier, a decision tree is a classification algorithm that builds a hierarchical structure where each internal node represents a decision based on a feature of the data and each leaf node represents a final decision, label, or prediction. The potential number of decision trees for any given dataset is vast, and so part of the process is to create the tree in such a way so that each choice provides the greatest information gain.

Entropy is often used as a measure to determine the best feature to split the data at each node. The feature that maximizes information gain (i.e., minimizes entropy) is chosen as the splitting criteria. Each node represents a choice that splits the data into subsets until the entropy is minimized (or below a predefined threshold) in each subset. Any subset that consists of data with only a single label or category is called pure, and no further splitting would be necessary. Such a subset becomes a leaf node. However, not all leaf nodes must be pure. Sometimes a leaf node just has a majority of a given label. This may be done to avoid overfitting the model, as we shall see in Using a Decision Tree to Predict.

Note that information gain is not the only measure that could be used to choose optimal splits. There are other measures, such as the Gini index, which is used in the CART (Classification and Regression Tree) algorithm. Although we will not cover these alternatives, it is important to realize that they exist.

So, how do we construct the nodes of the tree? Start with the root node consisting of the entire dataset. In our example, there are 10 red points and 9 blue points. Consider this as a set with uncertainty of containing Red $(r)$ or Blue $(b)$ points. The probabilities of each case are $P (r) = \frac{9}{19} = 0.47$ and $P (b) = \frac{10}{19} = 0.53$ . The total entropy of the root note is thus:

H (root) = - [0.47 {log}_{2} (0.47) + 0.53 {log}_{2} (0.53)] = 0.9974

This is very close to the entropy of a coin toss. We need to separate points based on some feature of the dataset. We will use an inequality such as $X \leq a$ or $Y \leq a$ , where $a$ is an appropriately chosen constant. The choice is based on maximizing information gain, which will be defined later in the chapter.

The type of decision tree that we are building is also called a regression tree, as the features are numeric, and we are going to use number comparisons to make decisions. Let’s start with an arbitrary choice, say $X \leq 7.75$ . (This point is halfway between two adjacent values of $X$ , 7.6 and 7.9. This provides a bit of a buffer between points, which often produces better predictions on noisy data.) The inequality splits the data into two subsets: Left $= L =$ {3 Red, 2 Blue} and Right $= R =$ {6 Red, 8 Blue}. Note that $L$ has 5 points, so its weight with respect to its parent node is $w_{L} = \frac{5}{19}$ . Similarly, the weight of $R$ is $w_{L} = \frac{14}{19}$ . The information gain, or IG, of a split is defined by:

I G = H (parent) - w_{L} H (L) - w_{R} H (R)

Let’s first find the entropy measures of $L$ and $R$ .

H (L) = - [\frac{3}{5} {log}_{2} (\frac{3}{5}) + \frac{2}{5} {log}_{2} (\frac{2}{5})] = 0.9710

H (R) = - [\frac{6}{14} {log}_{2} (\frac{6}{14}) + \frac{8}{14} {log}_{2} (\frac{8}{14})] = 0.9852

Thus, the information gain of this split is $I G = 0.9974 - \frac{5}{19} (0.9710) - \frac{14}{19} (0.9852) = 0.016$ . The information gain is positive, which is a good thing. It is possible to have negative information gain if the split is chosen in such a way that raises the total entropy. But how good is this split? Intuitively, we could have done better by splitting in a way that groups more of the red or more of the blue into one of the two subsets. However, if we want the absolute best split, then we would need to try every possibility and pick the one that has the greatest IG. This is not difficult for a computer to do, as there are only finitely many points, hence only finitely many ways to split into subsets. For comparison, we will do one more, using the other feature $(Y)$ .

Split on $Y \leq 13.75$ (which is halfway between two adjacent Y values, 13 and 14.5). Now L = {9 Red, 6 Blue}, and R = {0 Red, 4 Blue}. The entropy of L is $- [\frac{9}{15} {log}_{2} (\frac{9}{15}) + \frac{6}{15} {log}_{2} (\frac{6}{15})] = 0.9710$ . Note that R is a pure node, which has entropy 0. Entropy equal to 0 implies that the outcome is known, that is, there is no uncertainty.

I G = 0.9974 - \frac{15}{19} (0.9710) - \frac{4}{19} (0) = 0.231

Since the value of IG is greater for this split compared to the previous one, it is better to split on $Y \leq 13.75$ . It turns out that this is indeed the split with the greatest information gain, so we will use it to start building our decision tree. Each new node has the potential to become a new parent node, or if it is a pure node, then we do not branch from it.

Figure 6.20 shows a flowchart version of the completed decision tree, which completely classifies all of the data points so that all leaf nodes are pure. You can also see the way that each inequality divides up the plane in Figure 6.21.

A flowchart of a completed decision tree. The top box says “y is less than or equal to 13.75” and branches left to “x is less than or equal to 14.55” or “blue.” The left box branches left to “y is less than or equal to 8.45” and right to “red.” Boxes continue to branch until the outcome is either “red” or “blue.”

Figure 6.20 Flowchart for a Decision Tree. The root node is at the top. Each internal node asks a question about the data and routes accordingly. Leaf nodes indicate the label for the data.

A scatterplot with 19 data points—9 points are red and 10 are blue. The X and Y axes both range from 0 to 20. The data set is categorized into seven red and blue regions separated by light purple lines that are defined by inequalities.

Figure 6.21 Regions of the Plane Corresponding to Pure Nodes. The dataset is completely categorized into red and blue regions separated by lines that are defined by inequalities.

In the preceding example, the maximum number of comparisons, or decisions, that could be made on any input data would be the length of the longest branch of the tree, which is 7 nodes. The number of levels, or length of the longest branch in a decision tree, is known as its depth. So this tree has a depth of 7. However, the last few levels of the tree may in fact be overfitting on the training data. In the next section we will find out how to test and use decision trees and explore ways to reduce their overfitting variance by pruning and other methods.

Using a Decision Tree to Predict

Once you have created a decision tree model, it can be used to classify new data. Simply follow the flowchart to find out which leaf node your data points falls into.

Example 6.8

Problem

Using the decision tree model M, classify the following points as Red or Blue: (9.9, 2.5), (15.1, 12.6), (3.5, 10.2). (Hint: Follow the flowchart shown in Figure 6.20.)

Solution

$(X, Y) = (9.9, 2.5)$ . Starting at the root node, branch left $(Y = 2.5 \leq 13.75)$ . Branch left again $(X = 9.9 \leq 14.55)$ . Branch left once again $(Y = 2.5 \leq 8.45)$ . This puts us into a leaf node marked Blue. So (9.9, 2.5) should be labeled Blue.

$(X, Y) = (15.1, 12.6)$ . Starting at the root node, branch left $(Y = 12.6 \leq 13.75)$ . Then branch right $(X = 15.1 > 14.55)$ . This is a Red leaf node, and so the point (15.1, 12.6) gets the label Red.

$(X, Y) = (3.5, 10.2)$ . Starting at the root node, branch left $(Y = 10.2 \leq 13.75)$ . Branch left again $(X = 3.5 \leq 14.55)$ . Then branch right $(Y = 10.2 > 8.45)$ , then left $(X = 3.5 \leq 8.65)$ , and then left again $(X = 3.5 \leq 4.2)$ . Finally, take the right branch $(X = 3.5 > 2.55)$ , which gives the label Blue to the point (3.5, 10.2).

Training and Testing

The decision tree model M we constructed in the previous section classifies the original dataset S with 100% accuracy. The reason is quite simple; we forced it to! Starting from the root node containing all the points, the data was split into subsets and smaller subsets until every leaf node contained a pure subset (all one label). However, there is no guarantee that M accurately represents the underlying structure of the data. Think of our dataset S as a random sample of points taken from a larger population of points. There could have been noise, outliers, and other factors present in our sample that cause M to give spurious predictions. In order to evaluate the predictive power of a decision tree, we should try it out on a testing set of additional labeled data. We measure the accuracy of the model using:

Accuracy = \frac{Number of correct predictions}{Number of predictions}

Example 6.9

Problem

Suppose that the three points from Example 6.8 have known labels, as indicated in Table 6.8. How accurate is our decision tree model based on these test points?

X	Y	Label	Predicted Label
9.9	2.5	Blue	Blue
15.1	12.6	Red	Red
3.5	10.2	Red	Blue

Table 6.8 Test Points

Solution

According to this table, M was correct 2 out of 3 times, so the accuracy is $2 / 3 = 66.6 %$ .

Pruning, Depth-Limiting, and Other Methods for Reducing Variance

Decision trees often suffer from overfitting (high variance), which adversely affects their accuracy. Fortunately, there are a number of methods that can be used to reduce overfitting.

A decision tree that has a lot of internal nodes is prone to overfitting because it may be making decisions based on only a very small amount of data or very small difference in the data’s features. “Splitting hairs” is a common term for emphasizing small differences that may not actually matter much. We generally do not want our models to split hairs. Pruning is the process of reducing the size of a decision tree by removing branches that split the data too finely.

For example, in creating the model M, the subset of S defined by the sequence, left, left, right—that is: $Y \leq 13.75$ , $X \leq 14.55$ , and $Y > 8.45$ —consists of the points (7.6, 10.2, Red), (7.9, 10.6, Red), (3.2, 11, Blue), (5.2, 12.2, Red), (9.4, 12.8, Blue), and (1.9, 13, Red). As you can see in Figure 6.20, it takes three additional inequalities to separate the four red and two blue points into regions of the same color. At the end of the process, each of the two blue points and one of the red points became the sole members of their subsets.

Splitting a set into singletons (one-element subsets) is generally not a great idea in machine learning, as this puts too much predictive weight on a single observation. One method of pruning is to remove branches that split the data into sets L and R in which one of L or R is a singleton (or below a predefined proportion of the original set). Then the node becomes a leaf consisting of data having multiple labels. The label of this new leaf is determined by majority vote. That is, the label that occurs most often becomes the predicted label of any data that finds itself to this node. This method is commonly called error-based (reduced-error) pruning, as it seeks to remove branches that do not significantly increase the classification error rate.

Example 6.10

Problem

Use pruning to reduce the complexity of the decision tree model M described in this section. Then reclassify the data points given in Example 6.8.

Solution

From the node whose pathway is $Y \leq 13.75$ , $X \leq 14.55$ , $Y > 8.45$ , the model M branched on the inequality $X \leq 8.65$ . This splits the set into $L$ = {(7.6, 10.2, Red), (7.9, 10.6, Red), (3.2, 11, Blue), (5.2, 12.2, Red), (1.9, 13, Red)} and $R$ = {(9.4, 12.8, Blue)}. Because the split would produce a singleton, we simply make this node a leaf and prune away all branches and nodes underneath it. Now the node is a leaf with four red and two blue data points, and so this leaf is labeled Red. The flowchart and corresponding regions of the plane are shown in Figure 6.22 and Figure 6.23. Note that the pruned tree has a depth of 4, which is quite a bit smaller than the original tree’s depth of 7.

Figure 6.22 Flowchart for Pruned Decision Tree from Example 6.10

A scatterplot with 19 data points—9 points are red and 10 are blue. The X and Y axes both range from 0 to 20. The data set is categorized into four red and blue regions separated by light purple lines that are defined by inequalities.

Figure 6.23 Regions of the Plane Corresponding to Pure Nodes for the Pruned Tree. The pruned model divides the plane into fewer regions than the original model did, which likely captures the true underlying pattern better without overfitting.

Now let’s reclassify the points (9.9, 2.5), (15.1, 12.6), and (3.5, 10.2) as in Table 6.9.

$(X, Y) = (9.9, 2.5)$ . Starting at the root node, branch left $(Y = 2.5 \leq 13.75)$ . Branch left again $(X = 9.9 \leq 14.55)$ . Branch left once again $(Y = 2.5 \leq 8.45)$ , which is a blue leaf.

$(X, Y) = (15.1, 12.6)$ . Starting at the root node, branch left $(Y = 12.6 \leq 13.75)$ . Then branch right $(X = 15.1 > 14.55)$ , which is a red leaf.

$(X, Y) = (3.5, 10.2)$ . Starting at the root node, branch left $(Y = 10.2 \leq 13.75)$ . Branch left again $(X = 3.5 \leq 14.55)$ . Then branch right $(Y = 10.2 > 8.45)$ , which is now a red leaf.

X	Y	Label	Predicted Label
9.9	2.5	Blue	Blue
15.1	12.6	Red	Red
3.5	10.2	Red	Red

Table 6.9 Reclassified Test Points

Now we find the model to be 100% accurate on the testing data. (Of course, a model trained on real-world data with many more data points cannot possibly attain 100% accuracy, but this is possible on very small datasets.) Note that the model’s accuracy on the original training set S is no longer 100%. On S, the pruned tree misclassifies two points, (3.2, 11), and (9.4, 12.8) as red when they are in fact blue, and so the accuracy on the training set is now $17 / 19 = 89.5 %$ . That’s still pretty good!

Other methods exist for pruning a decision tree. There are two main types: pre-pruning and post-pruning. Pre-pruning refers to setting parameters or criteria for the complexity of the decision tree at the outset. New branches would be added only if the criteria are met. Post-pruning happens after the entire tree is created without restriction. In post-pruning, branches and nodes are removed according to user-defined rules, turning the previously internal nodes into leaf nodes. Error-based pruning may be implemented as either a pre- or post-pruning algorithm.

Depth-limiting pruning, as the name suggests, means that a decision tree cannot grow beyond a predefined depth (or number of levels). This pre-pruning technique helps to reduce the number of small splits, but care must be taken so as not to set the depth so low that the model fails to distinguish important details.
Leaf-limiting pruning puts a maximum cap on the total number of leaves. Like depth-limiting, leaf-limiting is used to restrict the overall complexity of the decision tree model but could lead to underfitting and reduce the predictive power of the model when the data itself possesses a great deal of complexity that needs to be captured.
Minimum description length (MDL) pruning is a post-pruning method that seeks to find the least complex form of a decision tree that meets an acceptable measure of accuracy. In general, there is always a trade-off between simplicity and accuracy in any machine learning algorithm. What MDL does is to err on the side of simplicity. A full description of the MDL algorithm falls outside the scope of this text.

Decision Trees in Python

The Python package sklearn includes powerful algorithms for creating and working with decision trees. Note that the decision tree classifier is limited to a maximum depth of 3 and the Gini index is used rather than information gain. The data is contained in the dataset RedBlue.csv.

Python Code

      # Import libraries
      import pandas as pd  ## for dataset management
      from sklearn.tree import DecisionTreeClassifier 
      from sklearn import tree
      import matplotlib.pyplot as plt
      
      # Read data file
      data = pd.read_csv('RedBlue.csv')
      
      inputs = data[['X', 'Y']]
      output = data['Label']
      
      # Train Decision Tree Classifier
      clf = DecisionTreeClassifier(max_depth=3)
      clf = clf.fit(inputs,output)
      
      # Plot tree with customization
      plt.figure(figsize=(12, 8))  # Set the size of the plot
      tree.plot_tree(clf,
                     rounded=True,       # Round the edges of the box
                     fontsize=10)        # Adjust font size for clarity
      plt.show()  # Display the plot

The resulting output will look like this:

A decision tree visualization showing the structure of a classification model. The tree has multiple nodes, each representing a decision based on feature values (x[0] and x[1]). The branches represent the possible outcomes of these decisions. The leaves of the tree represent the final classifications, with values indicating the number of samples in each class. Gini impurity values are shown at each node, measuring the degree of impurity or uncertainty in the data at that point in the tree.

6.4 Decision Trees

Learning Outcomes

Information and Entropy

Information

Entropy

Problem

Solution

Building a Decision Tree

Using a Decision Tree to Predict

Problem

Solution

Training and Testing

Problem

Solution

Pruning, Depth-Limiting, and Other Methods for Reducing Variance

Problem

Solution

Decision Trees in Python