Skip to ContentGo to accessibility pageKeyboard shortcuts menu
OpenStax Logo
Principles of Data Science

6.4 Decision Trees

Principles of Data Science6.4 Decision Trees

Learning Outcomes

By the end of this section, you should be able to:

  • 6.4.1 Measure information using entropy.
  • 6.4.2 Understand how entropy can be used to classify data.
  • 6.4.3 Build a top-down decision tree classifier.
  • 6.4.4 Use decision trees to classify data and make predictions.

How do you make decisions? Often, we choose one product over another based on a combination of features. For instance, imagine you are considering purchasing a new smartphone. You might compare different models based on criteria such as camera quality, battery life, processing speed, and price. If there are many options available, you might organize your decision process into a flow chart or tree.

As a simple example, consider the game of 20 Questions. One person thinks up a person, place, or thing, and the others ask questions in turn to try to narrow down the space of possible answers until there is enough information to identify the object. Suppose that there are only five objects to choose from (a very simple game of 20 Questions indeed!): Tree, Dog, Human, Rock, and Diamond. The diagram in Figure 6.18, which may be called a decision tree, shows the questions and responses that would identify all of these items.

A decision tree. The top box says “is it living?” and branches down to “yes” or “no.” The “yes” box branches to “Is it an animal?” with “no” leading to “tree” and “yes” leading to “does it walk on two legs?” From there, “yes” leads to “human” and “no” leads to “dog.” The original “no” branch leads to “is it worth a lot of money?” with “yes” leading to “diamond” and “no” leading to “rock.”
Figure 6.18 A Simple Decision Tree

In machine learning, a decision tree is an algorithm used for both classification and regression tasks, offering a visual and intuitive approach to solving complex problems using treelike structures to keep track of decisions based on the features of the dataset. Decision trees combine simplicity and flexibility in data analysis. They are simple structures to understand and create, inspired by the way our minds make choices. Decision trees break down decisions into a series of straightforward choices, much like following a trail of breadcrumbs through a dense forest. Their flexibility lies mainly in their nonlinear nature. Instead of trying to fit data to a linear model or some other rigid structure, the decision tree classifies data using a series of choices that depend on other choices, which may in turn depend on other choices, etc.

In this section, we will introduce information theory and entropy—a measure of information that is useful in constructing and using decision trees, illustrating their remarkable power while also drawing attention to potential pitfalls.

Information and Entropy

Information theory provides a framework for measuring and managing the uniqueness of data, or the degree of surprise or uncertainty associated with an event or message. For example, if I want to send a message consisting entirely of the letter “A” repeated 1,000 times, I do not have to send 1,000 “A”s. In fact, the phrase “1,000A” does the trick in only 5 characters (assuming the recipient of the message interprets it correctly). A related concept, entropy, is a measure of the amount of information in a message, structure, or system, as we will explore in the next section.

Information

Consider an event that has a certain probability pp. The higher pp is, the more likely the event is, and hence there is not much information (or uncertainty) contained in the occurrence of the event. For example, the event that my car starts when I get into it in the morning has a fairly low amount of information because my car is reliable, and I expect it to start. If, on the other hand, it does not start one morning, then that low-probability event would be surprising and would contain much more information.

Information is typically measured in bits, where one bit represents the amount of information needed to distinguish between two equally likely events or messages. For example, when you flip a fair coin, you gain one bit of information because you go from being uncertain about the outcome (heads or tails) to knowing the outcome (heads, for instance). To measure the number of bits of a whole number, you would use the base-2 logarithm (log2xlog2x), so it may not come as a surprise to learn that the base-2 logarithm plays a huge role in measuring information.

The information of an event with probability pp is log2plog2p. Recall that probabilities pp are always between 0 and 1, and if 0<p<10<p<1, then by properties of logarithms, we would have log2p<0log2p<0. This explains the negative sign that appears in the formula, as it makes sense to measure information with a positive number.

Note that the information contained in an event that is certain to occur (p=1)(p=1) is log21=0log21=0, while the information contained in knowing that a coin was flipped and heads came up (p=1/2)(p=1/2), would be log2(12)=(1)=1log2(12)=(1)=1, or one bit of information.

Let’s say the probability that my car will start on any given day is p=255/256p=255/256, and so the probability that it will not start is 1p=1/2561p=1/256. The information of the event that it starts is log2(255256)=0.0056log2(255256)=0.0056, whereas the information of the event that it does not start is log2(1256)=8log2(1256)=8. Essentially, there are 8 bits of information contained in the knowledge that my car does not start because this would only have happened about once in every 28=25628=256 days.

Entropy

Entropy is a measure that quantifies the average amount of information, or uncertainty, in a random variable or a probability distribution. It is closely related to information. Think of entropy as a measure of the disorder, randomness, or unpredictability in a system.

For a discrete random variable XX, with nn values xixi and associated probabilities pipi the entropy of XX is:

H(X)=i=1npilog2piH(X)=i=1npilog2pi

Example 6.7

Problem

Compute the entropy of the system described in the previous example about the probability of a car starting on any given day.

The idea of entropy is crucial in creating decision trees, as we shall see in the next section.

Building a Decision Tree

Suppose that you wish to classify data into some number of categories based on values of its features (inputs). We will assume that you have plenty of labeled data to train on. For the purposes of illustration, let’s say that your data consists of a set SS of points in the plane with labels Red and Blue. The format of each point is (X, Y, Label). Here is your training set.

(18.4, 3.3, Red) (10.1, 6.3, Blue) (9.1, 14.7, Blue) (16.3, 5, Red)

(12.8, 1.9, Blue) (6.9, 2.8, Blue) (3.2, 11, Blue) (17.7, 1.3, Red)

(8.6, 18.3, Blue) (16.9, 5, Red) (1.9, 13, Red) (12.6, 6.7, Blue)

(7.9, 10.6, Red) (14.6, 14.5, Blue) (9.4, 12.8, Blue) (7.6, 10.2, Red)

(19.6, 16.4, Blue) (19.5, 4, Red) (5.2, 12.2, Red)

A colored scatterplot of the dataset S is shown in Figure 6.19.

A scatterplot with 19 data points—9 points are red and 10 are blue. The X and Y axes both range from 0 to 20
Figure 6.19 Labeled Scatterplot of Dataset SS. Note the pattern of red and blue that would be impossible to predict from a typical linear or logistic regression.

Notice in Figure 6.19 that the red points fall into roughly two clusters that are bordered by blue points. In order to capture this feature, you will need a more flexible tool than linear or logistic regression. We will set up a decision tree!

As discussed earlier, a decision tree is a classification algorithm that builds a hierarchical structure where each internal node represents a decision based on a feature of the data and each leaf node represents a final decision, label, or prediction. The potential number of decision trees for any given dataset is vast, and so part of the process is to create the tree in such a way so that each choice provides the greatest information gain.

Entropy is often used as a measure to determine the best feature to split the data at each node. The feature that maximizes information gain (i.e., minimizes entropy) is chosen as the splitting criteria. Each node represents a choice that splits the data into subsets until the entropy is minimized (or below a predefined threshold) in each subset. Any subset that consists of data with only a single label or category is called pure, and no further splitting would be necessary. Such a subset becomes a leaf node. However, not all leaf nodes must be pure. Sometimes a leaf node just has a majority of a given label. This may be done to avoid overfitting the model, as we shall see in Using a Decision Tree to Predict.

Note that information gain is not the only measure that could be used to choose optimal splits. There are other measures, such as the Gini index, which is used in the CART (Classification and Regression Tree) algorithm. Although we will not cover these alternatives, it is important to realize that they exist.

So, how do we construct the nodes of the tree? Start with the root node consisting of the entire dataset. In our example, there are 10 red points and 9 blue points. Consider this as a set with uncertainty of containing Red (r)(r) or Blue (b)(b) points. The probabilities of each case are P(r)=919=0.47P(r)=919=0.47 and P(b)=1019=0.53P(b)=1019=0.53. The total entropy of the root note is thus:

H(root)=[0.47log2(0.47)+0.53log2(0.53)]=0.9974H(root)=[0.47log2(0.47)+0.53log2(0.53)]=0.9974

This is very close to the entropy of a coin toss. We need to separate points based on some feature of the dataset. We will use an inequality such as XaXa or YaYa, where aa is an appropriately chosen constant. The choice is based on maximizing information gain, which will be defined later in the chapter.

The type of decision tree that we are building is also called a regression tree, as the features are numeric, and we are going to use number comparisons to make decisions. Let’s start with an arbitrary choice, say X7.75X7.75. (This point is halfway between two adjacent values of XX, 7.6 and 7.9. This provides a bit of a buffer between points, which often produces better predictions on noisy data.) The inequality splits the data into two subsets: Left =L==L= {3 Red, 2 Blue} and Right =R==R= {6 Red, 8 Blue}. Note that LL has 5 points, so its weight with respect to its parent node is wL=519wL=519. Similarly, the weight of RR is wL=1419wL=1419. The information gain, or IG, of a split is defined by:

IG=H(parent)wLH(L)wRH(R)IG=H(parent)wLH(L)wRH(R)

Let’s first find the entropy measures of LL and RR.

H(L)=[35log2(35)+25log2(25)]=0.9710H(L)=[35log2(35)+25log2(25)]=0.9710
H(R)=[614log2(614)+814log2(814)]=0.9852H(R)=[614log2(614)+814log2(814)]=0.9852

Thus, the information gain of this split is IG=0.9974519(0.9710)1419(0.9852)=0.016IG=0.9974519(0.9710)1419(0.9852)=0.016. The information gain is positive, which is a good thing. It is possible to have negative information gain if the split is chosen in such a way that raises the total entropy. But how good is this split? Intuitively, we could have done better by splitting in a way that groups more of the red or more of the blue into one of the two subsets. However, if we want the absolute best split, then we would need to try every possibility and pick the one that has the greatest IG. This is not difficult for a computer to do, as there are only finitely many points, hence only finitely many ways to split into subsets. For comparison, we will do one more, using the other feature (Y)(Y).

Split on Y13.75Y13.75 (which is halfway between two adjacent Y values, 13 and 14.5). Now L = {9 Red, 6 Blue}, and R = {0 Red, 4 Blue}. The entropy of L is [915log2(915)+615log2(615)]=0.9710[915log2(915)+615log2(615)]=0.9710. Note that R is a pure node, which has entropy 0. Entropy equal to 0 implies that the outcome is known, that is, there is no uncertainty.

IG=0.99741519(0.9710)419(0)=0.231IG=0.99741519(0.9710)419(0)=0.231

Since the value of IG is greater for this split compared to the previous one, it is better to split on Y13.75Y13.75. It turns out that this is indeed the split with the greatest information gain, so we will use it to start building our decision tree. Each new node has the potential to become a new parent node, or if it is a pure node, then we do not branch from it.

Figure 6.20 shows a flowchart version of the completed decision tree, which completely classifies all of the data points so that all leaf nodes are pure. You can also see the way that each inequality divides up the plane in Figure 6.21.

A flowchart of a completed decision tree. The top box says “y is less than or equal to 13.75” and branches left to “x is less than or equal to 14.55” or “blue.” The left box branches left to “y is less than or equal to 8.45” and right to “red.” Boxes continue to branch until the outcome is either “red” or “blue.”
Figure 6.20 Flowchart for a Decision Tree. The root node is at the top. Each internal node asks a question about the data and routes accordingly. Leaf nodes indicate the label for the data.

A scatterplot with 19 data points—9 points are red and 10 are blue. The X and Y axes both range from 0 to 20. The data set is categorized into seven red and blue regions separated by light purple lines that are defined by inequalities.
Figure 6.21 Regions of the Plane Corresponding to Pure Nodes. The dataset is completely categorized into red and blue regions separated by lines that are defined by inequalities.

In the preceding example, the maximum number of comparisons, or decisions, that could be made on any input data would be the length of the longest branch of the tree, which is 7 nodes. The number of levels, or length of the longest branch in a decision tree, is known as its depth. So this tree has a depth of 7. However, the last few levels of the tree may in fact be overfitting on the training data. In the next section we will find out how to test and use decision trees and explore ways to reduce their overfitting variance by pruning and other methods.

Using a Decision Tree to Predict

Once you have created a decision tree model, it can be used to classify new data. Simply follow the flowchart to find out which leaf node your data points falls into.

Example 6.8

Problem

Using the decision tree model M, classify the following points as Red or Blue: (9.9, 2.5), (15.1, 12.6), (3.5, 10.2). (Hint: Follow the flowchart shown in Figure 6.20.)

Training and Testing

The decision tree model M we constructed in the previous section classifies the original dataset S with 100% accuracy. The reason is quite simple; we forced it to! Starting from the root node containing all the points, the data was split into subsets and smaller subsets until every leaf node contained a pure subset (all one label). However, there is no guarantee that M accurately represents the underlying structure of the data. Think of our dataset S as a random sample of points taken from a larger population of points. There could have been noise, outliers, and other factors present in our sample that cause M to give spurious predictions. In order to evaluate the predictive power of a decision tree, we should try it out on a testing set of additional labeled data. We measure the accuracy of the model using:

Accuracy=Number of correct predictionsNumber of predictionsAccuracy=Number of correct predictionsNumber of predictions

Example 6.9

Problem

Suppose that the three points from Example 6.8 have known labels, as indicated in Table 6.8. How accurate is our decision tree model based on these test points?

X Y Label Predicted Label
9.9 2.5 Blue Blue
15.1 12.6 Red Red
3.5 10.2 Red Blue
Table 6.8 Test Points

Pruning, Depth-Limiting, and Other Methods for Reducing Variance

Decision trees often suffer from overfitting (high variance), which adversely affects their accuracy. Fortunately, there are a number of methods that can be used to reduce overfitting.

A decision tree that has a lot of internal nodes is prone to overfitting because it may be making decisions based on only a very small amount of data or very small difference in the data’s features. “Splitting hairs” is a common term for emphasizing small differences that may not actually matter much. We generally do not want our models to split hairs. Pruning is the process of reducing the size of a decision tree by removing branches that split the data too finely.

For example, in creating the model M, the subset of S defined by the sequence, left, left, right—that is: Y13.75Y13.75, X14.55X14.55, and Y>8.45Y>8.45—consists of the points (7.6, 10.2, Red), (7.9, 10.6, Red), (3.2, 11, Blue), (5.2, 12.2, Red), (9.4, 12.8, Blue), and (1.9, 13, Red). As you can see in Figure 6.20, it takes three additional inequalities to separate the four red and two blue points into regions of the same color. At the end of the process, each of the two blue points and one of the red points became the sole members of their subsets.

Splitting a set into singletons (one-element subsets) is generally not a great idea in machine learning, as this puts too much predictive weight on a single observation. One method of pruning is to remove branches that split the data into sets L and R in which one of L or R is a singleton (or below a predefined proportion of the original set). Then the node becomes a leaf consisting of data having multiple labels. The label of this new leaf is determined by majority vote. That is, the label that occurs most often becomes the predicted label of any data that finds itself to this node. This method is commonly called error-based (reduced-error) pruning, as it seeks to remove branches that do not significantly increase the classification error rate.

Example 6.10

Problem

Use pruning to reduce the complexity of the decision tree model M described in this section. Then reclassify the data points given in Example 6.8.

Other methods exist for pruning a decision tree. There are two main types: pre-pruning and post-pruning. Pre-pruning refers to setting parameters or criteria for the complexity of the decision tree at the outset. New branches would be added only if the criteria are met. Post-pruning happens after the entire tree is created without restriction. In post-pruning, branches and nodes are removed according to user-defined rules, turning the previously internal nodes into leaf nodes. Error-based pruning may be implemented as either a pre- or post-pruning algorithm.

  • Depth-limiting pruning, as the name suggests, means that a decision tree cannot grow beyond a predefined depth (or number of levels). This pre-pruning technique helps to reduce the number of small splits, but care must be taken so as not to set the depth so low that the model fails to distinguish important details.
  • Leaf-limiting pruning puts a maximum cap on the total number of leaves. Like depth-limiting, leaf-limiting is used to restrict the overall complexity of the decision tree model but could lead to underfitting and reduce the predictive power of the model when the data itself possesses a great deal of complexity that needs to be captured.
  • Minimum description length (MDL) pruning is a post-pruning method that seeks to find the least complex form of a decision tree that meets an acceptable measure of accuracy. In general, there is always a trade-off between simplicity and accuracy in any machine learning algorithm. What MDL does is to err on the side of simplicity. A full description of the MDL algorithm falls outside the scope of this text.

Decision Trees in Python

The Python package sklearn includes powerful algorithms for creating and working with decision trees. Note that the decision tree classifier is limited to a maximum depth of 3 and the Gini index is used rather than information gain. The data is contained in the dataset RedBlue.csv.

Python Code

      # Import libraries
      import pandas as pd  ## for dataset management
      from sklearn.tree import DecisionTreeClassifier 
      from sklearn import tree
      import matplotlib.pyplot as plt
      
      # Read data file
      data = pd.read_csv('RedBlue.csv')
      
      inputs = data[['X', 'Y']]
      output = data['Label']
      
      # Train Decision Tree Classifier
      clf = DecisionTreeClassifier(max_depth=3)
      clf = clf.fit(inputs,output)
      
      # Plot tree with customization
      plt.figure(figsize=(12, 8))  # Set the size of the plot
      tree.plot_tree(clf,
                     rounded=True,       # Round the edges of the box
                     fontsize=10)        # Adjust font size for clarity
      plt.show()  # Display the plot
    

The resulting output will look like this:

A decision tree visualization showing the structure of a classification model. The tree has multiple nodes, each representing a decision based on feature values (x[0] and x[1]). The branches represent the possible outcomes of these decisions. The leaves of the tree represent the final classifications, with values indicating the number of samples in each class. Gini impurity values are shown at each node, measuring the degree of impurity or uncertainty in the data at that point in the tree.
Citation/Attribution

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution-NonCommercial-ShareAlike License and you must attribute OpenStax.

Attribution information
  • If you are redistributing all or part of this book in a print format, then you must include on every physical page the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
  • If you are redistributing all or part of this book in a digital format, then you must include on every digital page view the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
Citation information

© Dec 19, 2024 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.