Learning Outcomes
By the end of this section, you should be able to:
- 7.2.1 Discuss the goal of adjusting weights and bias to reduce loss/error in a neural network.
- 7.2.2 Use static backpropagation to train a neural network.
- 7.2.3 Define recurrent neural networks and discuss their advantages/disadvantages.
In the previous section, we discussed the single-layer perceptron. Training the perceptron is straightforward as it only requires adjusting a set of weights and biases that directly affect the output. Neural networks that have more than one layer, such as multilayer perceptrons (MLPs), on the other hand, must be trained using methods that can change the weights and biases in the hidden layers as well.
Backpropagation is an algorithm used to train neural networks by determining how the errors depend on changes in the weights and biases of all the neurons, starting from the output layer and working backward through the layers, recursively updating the parameters based on how the error changes in each layer. In this section, we'll explore how neural networks adjust their weights and biases to minimize error (or loss), ultimately improving their ability to make accurate predictions. Fundamentally, backpropagation is a huge optimization problem. In this text, we focus on the intuitive ideas of this optimization problem rather than going into a deep dive of the methods and theory that are required to perform the optimization steps of backpropagation.
Exploring Further
Backpropagation
A full treatment of backpropagation requires familiarity with matrix operations, calculus, and numerical analysis, among other things. Such topics fall well outside the scope of this text, there are many resources online that go into more depth, such as Neural Networks and Deep Learning by Michael Nielsen (2019) and What Is Backpropagation Really Doing? by 3Blue1Brown.
Static Backpropagation
Backpropagation is a supervised learning algorithm, meaning that it trains on data that has already been classified (see What Is Machine Learning? for more about supervised learning in general). The goal is to iteratively adjust the weights of the connections between neurons in the network and biases associated with each neuron to minimize the difference between the predicted output and the actual target values. The term static backpropagation refers to adjustment of parameters (weights and biases) only, in contrast to dynamic backpropagation, which may also change the underlying structure (neurons, layers, connections, etc.).
Here's a high-level overview of the static backpropagation algorithm:
- Forward pass. During the forward pass, input data is propagated through the network layer by layer, using the current values of weights and biases along with the chosen activation function, . If the input vector is denoted by , and hidden layers by , , etc., until we reach the output layer, , then each vector is obtained from the previous using a formula of the form . (Note: This formula hides a lot of complex number crunching. The values of the and are specific to each layer and so contain the weight and bias information for all the neurons in that layer. Indeed, is no longer just a vector in this context, but a matrix, which may be regarded as a table of numbers. Finally, the activation function is applied to all entries of the vector input, resulting in a vector output. These details are not essential to the fundamental understanding of the process, though.)
- Error calculation. Once the network produces an output, the error between the predicted output and the actual target value is computed. Error may be computed in many ways. The function used to measure error in a neural network is called a loss or cost function.
- Backward pass (backpropagation). The error at each layer is recursively propagated backward through the network.
- Update parameters. The weights and biases of the network are updated in such a way that reduces the error, typically using an optimization algorithm such as gradient descent (which we will discuss later).
- Repeat. The steps are repeated on training data until a sufficient level of accuracy is achieved.
Error and Loss
Error generally refers to the difference between the predicted output of the neural network and the actual target values for a single data point. It is a quantitative measure of the mistakes made by the network in its predictions. The loss function is a mathematical function that takes the predicted output and the actual target values as inputs and outputs a single value representing the error. Learning is essentially minimizing the loss function over all training data. Suppose that the neural network outputs and the target output is . Common loss functions that would measure the error between and include:
- Mean squared error (MSE): . The formula for MSE (introduced in Training and Testing a Model), is important due to its connection with the concept of variance in statistics. It is a commonly used loss function in regression tasks, where the goal is to predict continuous values.
- Binary cross entropy loss: . This loss function is commonly used in binary classification tasks—that is, when the output is either 0 or 1. (Note: The minus sign in front of the formula is there to make the overall value positive, as the values of the logarithms will generally be negative. The term entropy is related to the measures of information defined in Decision Trees.)
- Hinge loss: . This function is also commonly useful in binary classification.
- Sparse categorical cross entropy: A generalization of binary cross entropy, useful when the target labels are integers.
Average loss is computed as an average (mean) of the loss over all the data points. The closer to zero the average loss is, the smaller the error in predictions. The choice of loss function depends on the specific task, the nature of the inputs and outputs, and many other factors, which will be discussed in more detail in Introduction to Deep Learning.
Example 7.3
Problem
The outputs of a neural network on two data points are provided in Table 7.1:
Data Point | Predicted | Actual (Target) |
---|---|---|
1 | ||
2 |
Compute the average loss using the following loss functions.
- MSE
- Binary cross entropy loss
- Hinge loss
Solution
Here, the number of terms is .
- MSE
Data point 1:
Data point 2:
- Binary cross entropy loss
Shortcut: If , the first term is zero, , and the second term reduces to . If , the second term is zero, , and the first term reduces to .
Data point 1:
Data point 2:
Average loss: - Hinge loss
First, compute the values of . Then if any of these are negative, you would use 0 instead in the sum.
Data point 1: . . . .
Data point 2: . . . .
Average loss:
Note: The fact that each of the three loss functions gives a different result may signify the nature of the errors with respect to the model’s performance, but with only two data points in this example, it is not advisable to place much meaning in these numbers.
Gradient Descent and Backpropagation
Since the goal of training a neural network is to reduce the value of error or loss, we essentially need to solve a (very large) optimization problem. Suppose that you are training a neural network. Consider all the weights and biases—of which there may be thousands, millions, or even more!—as parameters currently in the network that may be changed to improve the model. On a forward pass with an arbitrary input vector , the output is obtained and compared against the actual output via a loss function, . How would you be able to use this information to change the parameters (weights and biases) so that on the next pass with the same input , the output provides a more accurate representation of the actual output , in the sense that the value of reduces? The key is to regard the whole neural network together with the loss calculation as a function of the weights and biases and use then techniques such as gradient descent (which we will briefly discuss in a following section) to find a minimum value. Here, and are capitalized because they represent the multitude of weights and biases over the entire network, not just from a single neuron or layer. Figure 7.7 shows a schematic diagram illustrating the function as a composite process.
As a very simple example, suppose that a neural network has only one neuron with weight and bias . Input is fed into the neuron, the weight and bias are applied to , and then an activation function is applied to the result to produce an output . We’ll use the sigmoid, as activation function for simplicity. Now, the result, , is compared to the true value , using a cost function, which for this example will be MSE. The composite function looks like this:
If there were more neurons, weights, and biases, then the formula would look much more complicated as we would need to keep track of many more variables and parameters, but the basic concept remains the same. We want to find values of and that minimize the value of . It takes many repeated cycles of adjusting weights and biases based on the values of the loss function. Each complete training cycle is called an epoch.
Now suppose that and . Consider a single point from the set of training data—for example, . The neural network computes a prediction, , as follows:
The loss with respect to the given actual output is:
Thus, we have the value , which indicates that there is some nonzero amount of loss on the prediction just made, and so the values of and should be adjusted. Instead of trying to adjust and individually, let’s find the direction from the point (0.5, 0.25) that would result in the fastest decrease of the error . If we label the direction of steepest decrease , then we could move from the point (0.5, 0.25) to a new point, , where is a small number that affects the learning rate of the neural network. This is the main idea behind gradient descent. In our example, we could find the direction of steepest descent by simply looking at how the surface is slanted near the point (0.5, 0.25). Figure 7.8 indicates the best direction with an arrow. (However, in practice, there will be no way to visualize the graph of loss with respect to weights and biases () because the points exist in very high-dimensional spaces. The details of gradient descent and how it is used in neural networks falls outside the scope of this textbook. (See this IBM article for a great introduction to this topic.)
One final point to mention is that gradient descent only works when the function F is differentiable, meaning that there are no corners, cusps, or points of discontinuity on the graph of . Since is a composite of many component functions, each component needs to be differentiable. Activation functions such as step, ReLU, and LReLU cannot be used with gradient descent. There are other methods available to backpropagate and train those kinds of networks. Alternatively, the non-smooth activation functions could be approximated by smooth versions—for example, sigmoid or tanh in place of step, or softplus in place of ReLU.
Training a Neural Network Using Backpropagation in Python using TensorFlow
The Python library TensorFlow, originally developed by the Google Brain team in 2015, can be used to build a neural network with backpropagation. (A tensor is a multidimensional array, generalizing the concept of vector.)
Let’s set up a neural network and train it to recognize the handwritten letters in the MNIST database. This dataset, called mnist_784, can be loaded automatically using the fetch_openml
command. The target labels must be converted to integers explicitly. Then the data is split into training (80%) and testing (20%) sets. The features are renormalized using StandardScaler
. Next, the neural network model is created using tf.keras.Sequential
. The model can be quite complex, but we will stick to the simplest case—one input layer with exactly as many neurons as features (784) and an output layer with 10 neurons, one for each digit. There are no hidden layers in this model. The command tf.keras.layers.Dense
builds the output layer of 10 neurons that is fully connected to the input layer of 784 layers (the value of X_train_scaled.shape[1]
). The softmax activation function is used, but other functions like ReLU are also available. The parameter optimizer='adam'
in the function model.compile
refers to an algorithm, Adaptive Moment Estimation, used to update the weights and biases of the model during training, the details of which are beyond the scope of this text.
Note: This code takes a while to execute—up to 5 minutes! Even with no hidden layers, the backpropagation takes some time to complete. When you run the code, you should see progress bars for each epoch (not shown in the Python output in the feature box).
Python Code
# import the libraries
import tensorflow as tf
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Load MNIST Digits dataset
mnist = fetch_openml('mnist_784', version=1)
# Split the dataset into training and testing sets 80/20
X_train, X_test, y_train, y_test = train_test_split(mnist.data, mnist.target, test_size=0.2, random_state=42)
# Standardize features by removing the mean and scaling to unit variance
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Cast the target variable as an integer (int32)
y_train = y_train.astype('int32')
y_test = y_test.astype('int32')
# Define the neural network architecture
model = tf.keras.Sequential([
tf.keras.layers.Dense(10, activation='softmax', input_shape=(X_train_scaled.shape[1],))
])
# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# Train the model
history = model.fit(X_train_scaled, y_train, epochs=10, batch_size=32, validation_split=0.2)
# Evaluate the model on test data
test_loss, test_accuracy = model.evaluate(X_test_scaled, y_test)
print("Test Loss:", test_loss)
print("Test Accuracy:", test_accuracy)
The resulting output will look like this:
Test Loss: 0.4051963686943054
Test Accuracy: 0.9189285635948181
Recurrent Neural Networks
Imagine if you could predict the stock market! What if you had a model that could take previous days’ prices of a stock and forecast the price of the stock tomorrow? Then you could maximize your profits by buying low and selling high with full knowledge of when those high and low points in price would occur. Given the extremely unpredictable nature of the stock market, it is unlikely that a simple forecasting model would be useful, so let’s consider a neural network. You might set it up so that the input vector holds 14 days of previous price data and the output y would be the predicted price of the stock on the next day. After a while, you decide to modify the design of the neural network to take in 30 days of price data. This would involve starting over, training and testing the new model from scratch. In fact, any time you decided to change the number of inputs, you would have to create a new model and go through the training and testing phases again and again. It would be much more convenient to have a model that can take different amounts of input data. Moreover, if the model first trains on 14 days of previous price data, then it would be desirable for the network to “remember” that training so when more data becomes available (30 days, 60 days, a year, etc.), it can simply improve its predictions from the baseline already established.
This flexibility is available in recurrent neural networks (RNNs). An RNN is a neural network that incorporates feedback loops, which are internal connections from one neuron to itself or among multiple neurons in a cycle. Essentially, the output of a neuron is sent back as input to itself as well as going forward to the neurons in the next layer.
In the simplest such model, one feedback loop connects a single neuron with itself. The output of the neuron is modified by a connecting weight and the result included in the sum, making up the input of the same neuron. So when a new signal comes into the neuron, it gets the extra signal of added to it. If the connecting weight is positive, then this generally causes the neuron to become more active over time. On the other hand, if the connecting weight is negative, then a negative feedback loop exists, which generally dampens the activity of the neuron over time. The connecting weights of an RNN are trained alongside all the weights and biases of the network using a variation of backpropagation called backpropagation through time (or BPTT). See Figure 7.9 for a simple diagram illustrating an RNN model.
The RNN model’s feedback loops provide a simple memory, but how does it allow for different amounts of input values? The key is to feed in single (or small batches of) datapoints sequentially. For example, if represents the prices of a stock on day 1, 2, 3, up to , and the actual value of the stock on day is , the RNN will take in just one element (typically, the most recent data being used first) and adjust parameters based on the accuracy or loss of the prediction . On the next pass, it takes in and produces a new prediction . Keep in mind, the feedback loops allow the RNN to remember (in some capacity) how it performed on the first input, and so the prediction is based on the two data points, . Similarly, when is fed into the model on the third pass, the RNN will produce a new prediction that is based on the three data points, . Thus, the feedback loops of an RRN provide a mechanism for the input to consist of any number of sequential input data points. This makes RNNs especially useful for time series data (See Time Series and Forecasting for an introduction to time series.)
Unrolling an RNN
The effect of feedback loops may be visualized by “unrolling” the RNN. Consider the simplest case of a single feedback loop from one neuron to itself. The effect of the connecting weight is equivalent to connecting to a copy of the same RNN (with identical weights and biases). Of course, since the second copy of the RNN has the same feedback loop, it can be unrolled to a third, fourth, fifth copy, etc. An “unrolled” RNN with copies of the neural net has inputs. The effect of feeding the entire vector into the unrolled model is equivalent to feeding the data points sequentially into the RNN. Figure 7.10 shows how this works.
Vanishing/Exploding Gradient Problem
It is harder to train an RNN because the model can be very sensitive to changes in the connecting weights. During the training phase, gradient descent and a modified version of backpropagation (BPTT, as mentioned earlier) would be used to adjust all the weights and biases. However, because of the feedback loops in RNN, connecting weights can become compounded many times until they become very large. This causes algorithms like gradient descent to perform very poorly because the high weights cause proportionally high changes in parameters, as opposed to the tiny changes required to find minimum points. This is known as the exploding gradient problem. On the other hand, if connecting weights are too small to begin with, then training can cause them to quickly approach zero, which is called the vanishing gradient problem. Both issues are important to be aware of and addressed when working with RNNs; otherwise, the accuracy of your model may be compromised.
Long Short-Term Memory Networks
A long short-term memory (LSTM) network is a type of RNN designed to overcome the problems of exploding or vanishing gradients by incorporating memory cells that can capture long-term dependencies better than simple feedback loops can. LSTMs were introduced in 1997 and have since become widely used in various applications, including natural language processing, speech recognition, and time series forecasting.
LSTMs are generally more stable and easier to train than traditional RNNs, making them particularly well-suited for forecasting time series data with long-range trends and cyclic behaviors as well as working with sequential data that may seem much more unpredictable like natural language modeling, machine translation, and sentiment analysis. RNNs and LSTMs are a steppingstone to very sophisticated AI models, which we will discuss in the next section.
RNNs in Python
Consider the dataset MonthlyCoalConsumption.csv, which we analyzed using basic time series methods in Time Series and Forecasting. The dataset contains observations up to the end of 2022. Suppose you want to predict the monthly coal consumption in each month of the next year using a recurrent neural network. First, we will set up a simple RNN in Python. The library TensorFlow
can be used to build an RNN. We will use pandas
to load in the data and numpy
to put the data in a more usable form (np.array
) for the model. The RNN model is defined by the command tf.keras.layers.SimpleRNN
, where units
specifies the number of neurons in the layer and activation='tanh'
is the activation function.
Python Code
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler
# Load and preprocess the data
data = pd.read_csv('MonthlyCoalConsumption.csv')
values = np.array(data['Value'])
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_values = scaler.fit_transform(values.reshape(-1, 1))
# Prepare input sequences and target values
window_size = 12 # Number of months in each input sequence
X = []
y = []
for i in range(len(scaled_values) - window_size):
X.append(scaled_values[i:i+window_size])
y.append(scaled_values[i+window_size])
X = np.array(X)
y = np.array(y)
# Split data into training and testing sets
split = int(0.8 * len(X))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
# Define the RNN architecture
model = tf.keras.Sequential([
tf.keras.layers.SimpleRNN(units=32, activation='tanh',
input_shape=(window_size, 1)),
tf.keras.layers.Dense(1)
])
# Compile the model
model.compile(optimizer='adam', loss='mean_squared_error')
# Train the model
model.fit(X_train, y_train, epochs=50, batch_size=16)
# Evaluate the model
loss = model.evaluate(X_test, y_test)
print("Test Loss:", loss)
The resulting output will look like this:
Test Loss: 0.018054470419883728
The test loss is very low, and so we expect the model to be very accurate. Next, the model can be used to predict the unknown values in the next 12 months. Note: Since the original data was rescaled, the predictions made by the RNN need to be converted back into unnormalized data using the command scaler.inverse_transform
.
Python Code
# Predict future values
future_months = 12
last_window = scaled_values[-window_size:].reshape(1, window_size, 1)
predicted_values = []
for _ in range(future_months):
prediction = model.predict(last_window)
predicted_values.append(prediction[0, 0])
last_window = np.append(last_window[:, 1:, :], prediction.reshape(1, 1, 1), axis=1)
# Inverse transform the predicted births to get actual values
predicted_values = scaler.inverse_transform(np.array(predicted_values).reshape(-1, 1))
print("Predicted Values for the Next 12 Months:\n", predicted_values.flatten())
The resulting output will look like this:
Predicted Values for the Next 12 Months:
[49227.05 49190.258 41204.16 30520.389 29591.129 34516.586 42183.016
38737.39 32026.324 23482.027 23381.914 30206.467]
Finally, here is the plot containing the original data together with the predicted values. Note: There are a lot of lines of code devoted to formatting the data and axis labels before graphing. This is, unfortunately, the nature of the beast. The full details of what each piece of code does will not be explained thoroughly in this text.
Python Code
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator ## for graph formatting
from matplotlib.ticker import FuncFormatter ## for formatting y-axis
# Function to format the Y-axis values
def y_format(value, tick_number):
return f'{value:,.0f}'
# Plot original time series data and predicted values
plt.figure(figsize=(10, 6))
plt.plot(data['Month'], data['Value'], label='Original Data', color='blue')
plt.plot(range(len(data['Value']), len(data['Value']) + future_months), predicted_values,
label='Predicted Values', color='red')
# Create a list for the months
pd_months = pd.date_range(start='2016-01-01', end='2023-07-01', freq='MS')
xticks_positions = range(0, len(pd_months), 6) # Positions to display ticks (e.g., every 12 months)
xticks_labels = [pd_months[pos].strftime('%Y-%m') for pos in xticks_positions] # Labels corresponding to positions
plt.xlabel('Month')
plt.ylabel('Value')
plt.title('Monthly Consumption of Coal from 2016 to 2022 and Predicted Values for 2023')
plt.legend()
plt.xticks(ticks=xticks_positions, labels=xticks_labels, rotation=45)
# Apply the formatter to the Y-axis
plt.gca().yaxis.set_major_formatter(FuncFormatter(y_format))
plt.show()
The resulting output will look like this: