Dr. Shaun V. Ault; Dr. Soohyun Nam Liao; Larry Musolino

Learning Outcomes

By the end of this section, you should be able to:

7.3.1 Discuss the role of hidden layers in a neural network.
7.3.2 Describe loss/error functions and their role in training and testing a neural network.
7.3.3 Set up, test, and train a deep learning neural network that can classify real-world data.

Deep learning is a general term for the training and implementation of neural networks with many layers to learn the relationships of structured representations of data. A subset of machine learning, deep learning has revolutionized various fields, ranging from computer vision and natural language processing to robotics and health care. Led by research from organizations such as Google DeepMind, Meta, IBM, MosaicML, Databricks, OpenAI, and others, deep learning has achieved remarkable outcomes, including AlphaGo Zero's unprecedented mastery of the ancient and complex game of Go. With its ability to automatically learn hierarchical representations from raw data, deep learning has unlocked new frontiers in pattern recognition, decision-making, and creativity. This chapter has already laid out the foundations of neural networks in previous sections, and the following section will focus on developing deep learning neural network models that can classify hand-drawn numbers and simple images using the TensorFlow library within the Python language.

Neural Networks with Many Layers

The hidden layers in a neural network play a crucial role in transforming input data into useful representations that can be used to make predictions or decisions. Suppose that we are trying to train a neural network to recognize handwritten digits (as in Figure 7.2). Ideally, each hidden layer should pick up on identifiable features—for example, whether a digit has a loop/circle (like 0, 6, 8, and 9) or not (like 1, 2, 3, 4, 5, 7). Other hidden layers may then look for presence of vertical or horizontal strokes to further assist in classifying the numeral. However, hidden layers do not necessarily pick up on the same kinds of patterns that humans would. A recent direction in deep learning is in developing models that are capable of extracting higher-level features (e.g., loops, edges, textures) and reporting reasons why they made the decisions they made. Sophisticated image classification models can locate patterns, parts of images, and even spatial relationships between different parts of an image, all within the deeper layers of the network.

What makes deep learning deep? The depth of a neural network is the number of hidden layers it contains and is a defining characteristic of deep learning. Deeper networks have more capacity to learn complex patterns and relationships in the data. By stacking multiple layers of hidden units, deep learning models can learn to extract features at multiple levels of granularity, in a hierarchical (structured) manner, leading to more powerful and expressive representations of the input data. This hierarchical feature learning is key to the success of deep learning in various domains, including computer vision, natural language processing, and speech recognition, where complex patterns and relationships need to be captured from very high-dimensional data—that is, datasets that have many feature variables.

Loss Functions and Their Role in Training Neural Networks

You encountered many of the most common loss functions in Introduction to Neural Networks and Backpropagation. This section presents situations in which one loss function may be preferable over another. For instance, MSE (mean squared-error) is often used when the network is trained to predict continuous values, such as in regression analysis. For example, think of a neural network that is trained to provide a probability $P$ that a patient who displays a certain set of symptoms has cirrhosis. Because MSE measures the square of differences between the predicted and actual values, MSE penalizes larger errors heavily and will be very sensitive to outliers. Moreover, MSE is not scale-independent, meaning that the units of MSE are not the same as the units of the data. The following example illustrates this last point.

Example 7.4

Problem

A neural network is being trained to predict the maximum height of a variety of corn plant based on a number of features, such as soil acidity, temperatures during the growing season, etc. A sample of the predicted values along with the corresponding actual values are given in Table 7.2 in units of meters and centimeters. Note: The data is the same in both rows, but expressed in different units. Find the MSE for both rows of predictions. By what factor does MSE change?

	Predicted $\hat{y}$	Actual (Target) $y$
Units: m	$(1.18, 1.67, 1.12, 1.38, 1.42)$	$(1.07, 1.46, 1.05, 1.43, 1.45)$
Units: cm	$(118, 167, 112, 138, 142)$	$(107, 146, 105, 143, 145)$

Table 7.2 Predicted vs. Actual Values of Corn Height

Solution

Here, the number of terms is $n = 5$ . For the data measured in m:

\begin{array}{rcl} MSE & = & \frac{1}{5} [{(1.07 - 1.18)}^{2} + {(1.46 - 1.67)}^{2} + {(1.05 - 1.12)}^{2} + {(1.43 - 1.38)}^{2} + {(1.45 - 1.42)}^{2}] \\ = & 0.0128 \end{array}

For the data measured in cm:

MSE = \frac{1}{5} [{(107 - 118)}^{2} + {(146 - 167)}^{2} + {(105 - 112)}^{2} + {(143 - 138)}^{2} + {(145 - 142)}^{2}] = 128

We find that the MSE changes by a factor of $\frac{128}{0.0128} = 10,000$ , while the numeric values of the data only scaled by a factor of 100. Given that the dataset represents the same exact measurements, this example shows one of the major pitfalls in using MSE as a loss function.

Other commonly used loss functions include binary cross entropy, hinge loss, and (sparse) categorical cross entropy.

Binary cross entropy is more suitable for binary classification tasks—that is, deciding True or False based on the given input vector. More generally, if the output of a neural network is a probability of belonging to one of two classes (let’s say 0 and 1), then binary cross entropy tends to do a better job as a loss function for training the network. The main reasons are that binary cross entropy heavily penalizes misclassification, works well even on imbalanced data (i.e., datasets that have significantly more of one class than another), and is well-suited to handle output that may be interpreted on a probability scale from 0 to 1.

Hinge loss is another common loss function for binary classification tasks. It is suitable for scenarios where the goal is to maximize the margin between classes while minimizing classification errors. We define the margin between classes as a measure of the separation of data points belonging to different classifications. Larger margins typically suggest more well-defined classifications. For example, if the set of scores on a quiz are (71, 73, 78, 80, 81, 89, 90, 92), then there seems to be a more natural separation between the subsets of (71, 73), (78, 80, 81), and (89, 90, 92) compared to the usual binning of scores by their tens places into (71, 73, 78), (80, 81, 89), and (90, 92). In the first case, there is margin¹ of 5 units between (71, 73) and (78, 80, 81) and a margin of 8 units between (78, 80, 81) and (89, 90, 92), compared to the margins of 2 and 1 between the subsets (71, 73, 78), (80, 81, 89), and (80, 81, 89), (90, 92), respectively. Thus, it seems more reasonable (based on the scores and their margins) to assign a grade of A to students who scored in the subset (89, 90, 92), B to those in the subset (78, 80, 81), and C to those in the subset (71, 73).

Sparse categorical cross entropy is a generalization of binary cross entropy useful for multi-class classification tasks, where the network predicts the probability distribution over multiple classes. It measures the similarity between the predicted class probabilities and the true class labels. The term sparse refers to the nature of the output as integers versus one-hot encoded vectors (recall one-hot encoding from Other Machine Learning Techniques). (Sparse) categorical cross entropy is often paired with softmax activation in the output layer of the network to ensure that the predicted probabilities sum to 1 and represent a valid probability distribution.

Example: Using Deep Learning to Classify Handwritten Numerals

TensorFlow has robust functionality for neural networks that incorporate hidden layers using various activation functions. It is very easy to add hidden layers, as you can see in the following code. Compare the tf.keras.Sequential function from the similar example from Backpropagation. The model still begins with an input layer of size equal to the number of features. The input neurons are densely connected to a hidden layer of 128 neurons using the ReLU activation function. The next hidden layer has 64 neurons, again using ReLU as activation function. Finally, the output layer has 10 neurons (one for each digit) and softmax activation. It is a common practice to use a simpler activation function (like ReLU) in hidden layers that have many connections, while more sophisticated activation functions are appropriate at the output layer.

Note that the accuracy of this model improves to 0.967, as compared to 0.919 from our previous model that lacked hidden layers.

Python Code

    # import the libraries
    import tensorflow as tf
    from sklearn.datasets import fetch_openml
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler
    
    # Load MNIST Digits dataset
    mnist = fetch_openml('mnist_784', version=1)
    
    # Split the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(mnist.data, mnist.target, test_size=0.2, random_state=42)
    
    # Standardize features by removing the mean and scaling to unit variance
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    # Cast the target variable as an integer (int32)
    y_train = y_train.astype('int32')
    y_test = y_test.astype('int32')
    
    # Define the neural network architecture
    model = tf.keras.Sequential([
     tf.keras.layers.Dense(128, activation='relu', input_shape=(X_train_scaled.shape[1],)),
     tf.keras.layers.Dense(64, activation='relu'),
     tf.keras.layers.Dense(10, activation='softmax')
    ])
    
    # Compile the model
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    
    # Train the model
    history = model.fit(X_train_scaled, y_train, epochs=10, batch_size=32, validation_split=0.2)
    
    # Evaluate the model on test data
    test_loss, test_accuracy = model.evaluate(X_test_scaled, y_test)
    print("Test Loss:", test_loss)
    print("Test Accuracy:", test_accuracy)

The resulting output will look like this:

Test Loss: 0.20183834433555603
Test Accuracy: 0.968999981880188

Footnotes

1Technically, margin refers to the distance between the closest data point in a class and a separation hyperplane that would serve as a boundary with respect to some other class of data point. These topics are important in support vector machines (SVMs), another type of supervised learning model for classification and regression tasks, which we do not cover in this text.

7.3 Introduction to Deep Learning