Dr. Shaun V. Ault; Dr. Soohyun Nam Liao; Larry Musolino

Learning Outcomes

By the end of this section, you should be able to:

7.2.1 Discuss the goal of adjusting weights and bias to reduce loss/error in a neural network.
7.2.2 Use static backpropagation to train a neural network.
7.2.3 Define recurrent neural networks and discuss their advantages/disadvantages.

In the previous section, we discussed the single-layer perceptron. Training the perceptron is straightforward as it only requires adjusting a set of weights and biases that directly affect the output. Neural networks that have more than one layer, such as multilayer perceptrons (MLPs), on the other hand, must be trained using methods that can change the weights and biases in the hidden layers as well.

Backpropagation is an algorithm used to train neural networks by determining how the errors depend on changes in the weights and biases of all the neurons, starting from the output layer and working backward through the layers, recursively updating the parameters based on how the error changes in each layer. In this section, we'll explore how neural networks adjust their weights and biases to minimize error (or loss), ultimately improving their ability to make accurate predictions. Fundamentally, backpropagation is a huge optimization problem. In this text, we focus on the intuitive ideas of this optimization problem rather than going into a deep dive of the methods and theory that are required to perform the optimization steps of backpropagation.

Exploring Further

Backpropagation

A full treatment of backpropagation requires familiarity with matrix operations, calculus, and numerical analysis, among other things. Such topics fall well outside the scope of this text, there are many resources online that go into more depth, such as Neural Networks and Deep Learning by Michael Nielsen (2019) and What Is Backpropagation Really Doing? by 3Blue1Brown.

Static Backpropagation

Backpropagation is a supervised learning algorithm, meaning that it trains on data that has already been classified (see What Is Machine Learning? for more about supervised learning in general). The goal is to iteratively adjust the weights of the connections between neurons in the network and biases associated with each neuron to minimize the difference between the predicted output and the actual target values. The term static backpropagation refers to adjustment of parameters (weights and biases) only, in contrast to dynamic backpropagation, which may also change the underlying structure (neurons, layers, connections, etc.).

Here's a high-level overview of the static backpropagation algorithm:

Forward pass. During the forward pass, input data is propagated through the network layer by layer, using the current values of weights and biases along with the chosen activation function, $f$ . If the input vector is denoted by $x^{(0)}$ , and hidden layers by $x^{(1)}$ , $x^{(2)}$ , etc., until we reach the output layer, $x^{(t)}$ , then each vector is obtained from the previous using a formula of the form $x^{(k + 1)} = f (w x^{(k)} + b)$ . (Note: This formula hides a lot of complex number crunching. The values of the $w$ and $b$ are specific to each layer and so contain the weight and bias information for all the neurons in that layer. Indeed, $w$ is no longer just a vector in this context, but a matrix, which may be regarded as a table of numbers. Finally, the activation function is applied to all entries of the vector input, resulting in a vector output. These details are not essential to the fundamental understanding of the process, though.)
Error calculation. Once the network produces an output, the error between the predicted output and the actual target value is computed. Error may be computed in many ways. The function used to measure error in a neural network is called a loss or cost function.
Backward pass (backpropagation). The error at each layer is recursively propagated backward through the network.
Update parameters. The weights and biases of the network are updated in such a way that reduces the error, typically using an optimization algorithm such as gradient descent (which we will discuss later).
Repeat. The steps are repeated on training data until a sufficient level of accuracy is achieved.

Error and Loss

Error generally refers to the difference between the predicted output of the neural network and the actual target values for a single data point. It is a quantitative measure of the mistakes made by the network in its predictions. The loss function is a mathematical function that takes the predicted output and the actual target values as inputs and outputs a single value representing the error. Learning is essentially minimizing the loss function over all training data. Suppose that the neural network outputs ${\hat{y} = (\hat{y}}_{1}, {\hat{y}}_{2}, \dots, {\hat{y}}_{n})$ and the target output is $y = (y_{1}, y_{2}, \dots, y_{n})$ . Common loss functions that would measure the error between $\hat{y}$ and $y$ include:

Mean squared error (MSE): $\frac{1}{n} \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}$ . The formula for MSE (introduced in Training and Testing a Model), is important due to its connection with the concept of variance in statistics. It is a commonly used loss function in regression tasks, where the goal is to predict continuous values.
Binary cross entropy loss: $- \frac{1}{n} \sum_{i = 1}^{n} [y_{i} ln {\hat{y}}_{i} + (1 - y_{i}) ln {(1 - \hat{y}}_{i})]$ . This loss function is commonly used in binary classification tasks—that is, when the output is either 0 or 1. (Note: The minus sign in front of the formula is there to make the overall value positive, as the values of the logarithms will generally be negative. The term entropy is related to the measures of information defined in Decision Trees.)
Hinge loss: $\frac{1}{n} \sum_{i = 1}^{n} max (0, 1 - y_{i} {\hat{y}}_{i})$ . This function is also commonly useful in binary classification.
Sparse categorical cross entropy: A generalization of binary cross entropy, useful when the target labels are integers.

Average loss is computed as an average (mean) of the loss over all the data points. The closer to zero the average loss is, the smaller the error in predictions. The choice of loss function depends on the specific task, the nature of the inputs and outputs, and many other factors, which will be discussed in more detail in Introduction to Deep Learning.

Example 7.3

Problem

The outputs of a neural network on two data points are provided in Table 7.1:

Data Point	Predicted $\hat{y}$	Actual (Target) $y$
1	$(0.135, - 0.252, 0.873, 1.297)$	$(1, 0, 1, 1)$
2	$(0.729, 0.458, 0.975, - 0.025)$	$(1, 1, 1, 0)$

Table 7.1 Neural Network Output

Compute the average loss using the following loss functions.

MSE
Binary cross entropy loss
Hinge loss

Solution

Here, the number of terms is $n = 4$ .

MSE

$\frac{1}{n} \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2} = \frac{1}{4} [{(y_{1} - {\hat{y}}_{1})}^{2} + {(y_{2} - {\hat{y}}_{2})}^{2} + {(y_{3} - {\hat{y}}_{3})}^{2} + {(y_{4} - {\hat{y}}_{4})}^{2}]$
Data point 1:

$Loss = \frac{1}{4} [{(1 - 0.135)}^{2} + {(0 - (- 0.252))}^{2} + {(1 - 0.873)}^{2} + {(1 - 1.297)}^{2}] = 0.229$
Data point 2:

$Loss = \frac{1}{4} [{(1 - 0.729)}^{2} + {(0 - (- 0.252))}^{2} + {(1 - 0.873)}^{2} + {(1 - 1.297)}^{2}] = 0.092$
$Average loss : \frac{1}{2} (0.229 + 0.092) = 0.161$
Binary cross entropy loss

$- \frac{1}{n} \sum_{i = 1}^{n} [y_{i} ln {\hat{y}}_{i} + (1 - y_{i}) ln {(1 - \hat{y}}_{i})]$
Shortcut: If $y_{i} = 0$ , the first term is zero, $y_{i} ln {\hat{y}}_{i} = 0$ , and the second term reduces to $ln {(1 - \hat{y}}_{i})$ . If $y_{i} = 1$ , the second term is zero, $(1 - y_{i}) ln {(1 - \hat{y}}_{i}) = 0$ , and the first term reduces to $ln {\hat{y}}_{i}$ .
Data point 1:

$Loss = - \frac{1}{4} [ln (0.135) + ln (1 - (- 0.252)) + ln (0.873) + ln (1.297)] = 0.413$
Data point 2:

$Loss = - \frac{1}{4} [ln (0.729) + ln (0.458) + ln (0.975) + ln (1 - (- 0.025))] = 0.274$
Average loss: $\frac{1}{2} (0.413 + 0.274) = 0.344$
Hinge loss

$\frac{1}{n} \sum_{i = 1}^{n} max (0, 1 - y_{i} {\hat{y}}_{i})$
First, compute the values of $1 - y_{i} {\hat{y}}_{i}$ . Then if any of these are negative, you would use 0 instead in the sum.
Data point 1: $1 - y_{1} {\hat{y}}_{1} = 1 - (1) (0.135) = 0.865$ . $1 - y_{2} {\hat{y}}_{2} = 1 - (0) (- 0.252) = 1$ . $1 - y_{3} {\hat{y}}_{3} = 1 - (1) (0.873) = 0.127$ . $1 - y_{4} {\hat{y}}_{4} = 1 - (1) (1.297) = - 0.297$ .

$Loss = \frac{1}{4} [0.865 + 1 + 0.127 + 0] = 0.498$
Data point 2: $1 - y_{1} {\hat{y}}_{1} = 1 - (1) (0.729) = 0.271$ . $1 - y_{2} {\hat{y}}_{2} = 1 - (1) (0.458) = 0.542$ . $1 - y_{3} {\hat{y}}_{3} = 1 - (1) (0.975) = 0.025$ . $1 - y_{4} {\hat{y}}_{4} = 1 - (0) (- 0.025) = 1$ .

$Loss = \frac{1}{4} [0.271 + 0.542 + 0.025 + 1] = 0.460$
Average loss: $\frac{1}{2} (0.498 + 0.460) = 0.48$

Note: The fact that each of the three loss functions gives a different result may signify the nature of the errors with respect to the model’s performance, but with only two data points in this example, it is not advisable to place much meaning in these numbers.

Gradient Descent and Backpropagation

Since the goal of training a neural network is to reduce the value of error or loss, we essentially need to solve a (very large) optimization problem. Suppose that you are training a neural network. Consider all the weights and biases—of which there may be thousands, millions, or even more!—as parameters currently in the network that may be changed to improve the model. On a forward pass with an arbitrary input vector $x^{(0)}$ , the output $\hat{y}$ is obtained and compared against the actual output $y$ via a loss function, $C$ . How would you be able to use this information to change the parameters (weights and biases) so that on the next pass with the same input $x^{(0)}$ , the output $\hat{y}$ provides a more accurate representation of the actual output $y$ , in the sense that the value of $C$ reduces? The key is to regard the whole neural network together with the loss calculation as a function $F$ of the weights $W$ and biases $B$ and use then techniques such as gradient descent (which we will briefly discuss in a following section) to find a minimum value. Here, $W$ and $B$ are capitalized because they represent the multitude of weights and biases over the entire network, not just from a single neuron or layer. Figure 7.7 shows a schematic diagram illustrating the function $F$ as a composite process.

As a very simple example, suppose that a neural network has only one neuron with weight $w$ and bias $b$ . Input $x$ is fed into the neuron, the weight and bias are applied to $x$ , and then an activation function is applied to the result to produce an output $\hat{y}$ . We’ll use the sigmoid, $σ (x) = \frac{1}{1 + e^{- x}}$ as activation function for simplicity. Now, the result, $\hat{y} = σ (w x + b)$ , is compared to the true value $y$ , using a cost function, which for this example will be MSE. The composite function $F$ looks like this:

z = F (w, b) = MSE (σ (w x + b)) = {(y - \frac{1}{1 + e^{- (w x + b)}})}^{2}

If there were more neurons, weights, and biases, then the formula would look much more complicated as we would need to keep track of many more variables and parameters, but the basic concept remains the same. We want to find values of $w$ and $b$ that minimize the value of $F (w, b)$ . It takes many repeated cycles of adjusting weights and biases based on the values of the loss function. Each complete training cycle is called an epoch.

A neural network diagram illustrating the components of a neural network model, including input data, independent variables, neural network, output data, and loss function. It shows the input data (x), weights (W), biases (B), neural network, predicted output (ŷ), actual output (y), loss function (Loss = z), and the relationship between these components. Weights and biases are independent variables.

Figure 7.7 Neural Network with Loss Function

C

, Regarded as a Function of Weights and Biases. Given a set of weights

W

and biases

B

, the neural network is run on a set of input data

x^{(0)}

, the output

\hat{y}

is obtained and compared against the actual output

y

via a loss function,

C

, providing a single number: the loss,

z

. The goal of training is to adjust

W

and

B

such that

z

is minimized.

Now suppose that $w = 0.5$ and $b = 0.25$ . Consider a single point from the set of training data—for example, $(x, y) = (1, 0)$ . The neural network computes a prediction, $\hat{y}$ , as follows:

\hat{y} = σ (w x + b) = \frac{1}{1 + e^{- (0.5 (1) + 0.25)}} = 0.679

The loss with respect to the given actual output $(y = 0)$ is:

z = {(y - \hat{y})}^{2} = {(0 - 0.679)}^{2} = 0.46

Thus, we have the value $F (0.5, 0.25) = 0.46$ , which indicates that there is some nonzero amount of loss on the prediction just made, and so the values of $w$ and $b$ should be adjusted. Instead of trying to adjust $w$ and $b$ individually, let’s find the direction from the point (0.5, 0.25) that would result in the fastest decrease of the error $z$ . If we label the direction of steepest decrease $(d x, d y)$ , then we could move from the point (0.5, 0.25) to a new point, $(0.5, 0.25) + h \cdot (d w, d b) =$ $(0.5 + h \cdot d w, 0.25 + h \cdot d b)$ , where $h$ is a small number that affects the learning rate of the neural network. This is the main idea behind gradient descent. In our example, we could find the direction of steepest descent by simply looking at how the surface $z = F (w, b)$ is slanted near the point (0.5, 0.25). Figure 7.8 indicates the best direction with an arrow. (However, in practice, there will be no way to visualize the graph of loss with respect to weights $(W)$ and biases ( $B$ ) because the points $(W, B, z)$ exist in very high-dimensional spaces. The details of gradient descent and how it is used in neural networks falls outside the scope of this textbook. (See this IBM article for a great introduction to this topic.)

A 3D plot showing the relationship between weight, bias, and loss in a neural network. The loss function is visualized as a surface with varying heights, representing different loss values for different weight and bias combinations.

Figure 7.8 Simple Application of Gradient Descent. The point

(w, b) = (0.5, 0.25)

has loss of

z = 0.46

. The arrow indicates the direction of steepest descent from the point

(0.5, 0.25, 0.46)

on the surface.

One final point to mention is that gradient descent only works when the function F is differentiable, meaning that there are no corners, cusps, or points of discontinuity on the graph of $z = F (W, B)$ . Since $F$ is a composite of many component functions, each component needs to be differentiable. Activation functions such as step, ReLU, and LReLU cannot be used with gradient descent. There are other methods available to backpropagate and train those kinds of networks. Alternatively, the non-smooth activation functions could be approximated by smooth versions—for example, sigmoid or tanh in place of step, or softplus in place of ReLU.

Training a Neural Network Using Backpropagation in Python using TensorFlow

The Python library TensorFlow, originally developed by the Google Brain team in 2015, can be used to build a neural network with backpropagation. (A tensor is a multidimensional array, generalizing the concept of vector.)

Let’s set up a neural network and train it to recognize the handwritten letters in the MNIST database. This dataset, called mnist_784, can be loaded automatically using the fetch_openml command. The target labels must be converted to integers explicitly. Then the data is split into training (80%) and testing (20%) sets. The features are renormalized using StandardScaler. Next, the neural network model is created using tf.keras.Sequential. The model can be quite complex, but we will stick to the simplest case—one input layer with exactly as many neurons as features (784) and an output layer with 10 neurons, one for each digit. There are no hidden layers in this model. The command tf.keras.layers.Dense builds the output layer of 10 neurons that is fully connected to the input layer of 784 layers (the value of X_train_scaled.shape[1]). The softmax activation function is used, but other functions like ReLU are also available. The parameter optimizer='adam' in the function model.compile refers to an algorithm, Adaptive Moment Estimation, used to update the weights and biases of the model during training, the details of which are beyond the scope of this text.

Note: This code takes a while to execute—up to 5 minutes! Even with no hidden layers, the backpropagation takes some time to complete. When you run the code, you should see progress bars for each epoch (not shown in the Python output in the feature box).

Python Code

    # import the libraries
    import tensorflow as tf
    from sklearn.datasets import fetch_openml
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler
    
    # Load MNIST Digits dataset
    mnist = fetch_openml('mnist_784', version=1)
    
    # Split the dataset into training and testing sets 80/20
    X_train, X_test, y_train, y_test = train_test_split(mnist.data, mnist.target, test_size=0.2, random_state=42)
    
    # Standardize features by removing the mean and scaling to unit variance
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    # Cast the target variable as an integer (int32)
    y_train = y_train.astype('int32')
    y_test = y_test.astype('int32')
    
    # Define the neural network architecture
    model = tf.keras.Sequential([
     tf.keras.layers.Dense(10, activation='softmax', input_shape=(X_train_scaled.shape[1],))
    ])
    
    # Compile the model
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    
    # Train the model
    history = model.fit(X_train_scaled, y_train, epochs=10, batch_size=32, validation_split=0.2)
    
    # Evaluate the model on test data
    test_loss, test_accuracy = model.evaluate(X_test_scaled, y_test)
    print("Test Loss:", test_loss)
    print("Test Accuracy:", test_accuracy)

The resulting output will look like this:

Test Loss: 0.4051963686943054
Test Accuracy: 0.9189285635948181

Recurrent Neural Networks

Imagine if you could predict the stock market! What if you had a model that could take previous days’ prices of a stock and forecast the price of the stock tomorrow? Then you could maximize your profits by buying low and selling high with full knowledge of when those high and low points in price would occur. Given the extremely unpredictable nature of the stock market, it is unlikely that a simple forecasting model would be useful, so let’s consider a neural network. You might set it up so that the input vector $x^{(0)}$ holds 14 days of previous price data and the output y would be the predicted price of the stock on the next day. After a while, you decide to modify the design of the neural network to take in 30 days of price data. This would involve starting over, training and testing the new model from scratch. In fact, any time you decided to change the number of inputs, you would have to create a new model and go through the training and testing phases again and again. It would be much more convenient to have a model that can take different amounts of input data. Moreover, if the model first trains on 14 days of previous price data, then it would be desirable for the network to “remember” that training so when more data becomes available (30 days, 60 days, a year, etc.), it can simply improve its predictions from the baseline already established.

This flexibility is available in recurrent neural networks (RNNs). An RNN is a neural network that incorporates feedback loops, which are internal connections from one neuron to itself or among multiple neurons in a cycle. Essentially, the output of a neuron is sent back as input to itself as well as going forward to the neurons in the next layer.

In the simplest such model, one feedback loop connects a single neuron with itself. The output $z$ of the neuron is modified by a connecting weight $w_{c}$ and the result included in the sum, making up the input of the same neuron. So when a new signal comes into the neuron, it gets the extra signal of $w_{c} z$ added to it. If the connecting weight is positive, then this generally causes the neuron to become more active over time. On the other hand, if the connecting weight is negative, then a negative feedback loop exists, which generally dampens the activity of the neuron over time. The connecting weights of an RNN are trained alongside all the weights and biases of the network using a variation of backpropagation called backpropagation through time (or BPTT). See Figure 7.9 for a simple diagram illustrating an RNN model.

A neural network diagram with one input, a rectangle labeled W, a rectangle labeled b for bias, a rectangle labeled “activation function,” and an output. A feedback loop goes from “activation function” to a box labeled Wc and back to b.

Figure 7.9 RNN Model, Simplified. The feedback loop gives the model a rudimentary memory of prior training.

The RNN model’s feedback loops provide a simple memory, but how does it allow for different amounts of input values? The key is to feed in single (or small batches of) datapoints sequentially. For example, if $x^{(0)} = (x_{1}, x_{2}, x_{3}, \dots, x_{(n - 2)}, x_{(n - 1)})$ represents the prices of a stock on day 1, 2, 3, up to $n$ , and the actual value of the stock on day $n$ is $y$ , the RNN will take in just one element $x_{n - 1}$ (typically, the most recent data being used first) and adjust parameters based on the accuracy or loss of the prediction ${\hat{y}}_{1}$ . On the next pass, it takes in $x_{(n - 2)}$ and produces a new prediction ${\hat{y}}_{2}$ . Keep in mind, the feedback loops allow the RNN to remember (in some capacity) how it performed on the first input, and so the prediction ${\hat{y}}_{2}$ is based on the two data points, $(x_{n - 2}, x_{n - 1})$ . Similarly, when $x_{n - 3}$ is fed into the model on the third pass, the RNN will produce a new prediction ${\hat{y}}_{3}$ that is based on the three data points, $(x_{n - 3}, x_{n - 2}, x_{n - 1})$ . Thus, the feedback loops of an RRN provide a mechanism for the input to consist of any number of sequential input data points. This makes RNNs especially useful for time series data (See Time Series and Forecasting for an introduction to time series.)

Unrolling an RNN

The effect of feedback loops may be visualized by “unrolling” the RNN. Consider the simplest case of a single feedback loop from one neuron to itself. The effect of the connecting weight is equivalent to connecting to a copy of the same RNN (with identical weights and biases). Of course, since the second copy of the RNN has the same feedback loop, it can be unrolled to a third, fourth, fifth copy, etc. An “unrolled” RNN with $n$ copies of the neural net has $n$ inputs. The effect of feeding the entire vector $x^{(0)}$ into the unrolled model is equivalent to feeding the data points $(x_{1}, x_{2}, x_{3}, \dots, x_{n})$ sequentially into the RNN. Figure 7.10 shows how this works.

A diagram illustrating the forward pass of multiple neurons in a neural network. Each neuron receives an input, multiplies it by a weight (w), adds a bias (b), and passes the result through an activation function to produce an output (z). This process is repeated for multiple neurons, with the outputs potentially being fed into additional layers.

Figure 7.10 Unrolling an RNN

Vanishing/Exploding Gradient Problem

It is harder to train an RNN because the model can be very sensitive to changes in the connecting weights. During the training phase, gradient descent and a modified version of backpropagation (BPTT, as mentioned earlier) would be used to adjust all the weights and biases. However, because of the feedback loops in RNN, connecting weights can become compounded many times until they become very large. This causes algorithms like gradient descent to perform very poorly because the high weights cause proportionally high changes in parameters, as opposed to the tiny changes required to find minimum points. This is known as the exploding gradient problem. On the other hand, if connecting weights are too small to begin with, then training can cause them to quickly approach zero, which is called the vanishing gradient problem. Both issues are important to be aware of and addressed when working with RNNs; otherwise, the accuracy of your model may be compromised.

Long Short-Term Memory Networks

A long short-term memory (LSTM) network is a type of RNN designed to overcome the problems of exploding or vanishing gradients by incorporating memory cells that can capture long-term dependencies better than simple feedback loops can. LSTMs were introduced in 1997 and have since become widely used in various applications, including natural language processing, speech recognition, and time series forecasting.

LSTMs are generally more stable and easier to train than traditional RNNs, making them particularly well-suited for forecasting time series data with long-range trends and cyclic behaviors as well as working with sequential data that may seem much more unpredictable like natural language modeling, machine translation, and sentiment analysis. RNNs and LSTMs are a steppingstone to very sophisticated AI models, which we will discuss in the next section.

RNNs in Python

Consider the dataset MonthlyCoalConsumption.csv, which we analyzed using basic time series methods in Time Series and Forecasting. The dataset contains observations up to the end of 2022. Suppose you want to predict the monthly coal consumption in each month of the next year using a recurrent neural network. First, we will set up a simple RNN in Python. The library TensorFlow can be used to build an RNN. We will use pandas to load in the data and numpy to put the data in a more usable form (np.array) for the model. The RNN model is defined by the command tf.keras.layers.SimpleRNN, where units specifies the number of neurons in the layer and activation='tanh' is the activation function.

Python Code

    import numpy as np
    import pandas as pd
    import tensorflow as tf
    from sklearn.preprocessing import MinMaxScaler
    
    # Load and preprocess the data
    data = pd.read_csv('MonthlyCoalConsumption.csv')
    values = np.array(data['Value'])
    scaler = MinMaxScaler(feature_range=(0, 1))
    scaled_values = scaler.fit_transform(values.reshape(-1, 1))
    
    # Prepare input sequences and target values
    window_size = 12 # Number of months in each input sequence
    X = []
    y = []
    for i in range(len(scaled_values) - window_size):
      X.append(scaled_values[i:i+window_size])
      y.append(scaled_values[i+window_size])
    X = np.array(X)
    y = np.array(y)
    
    # Split data into training and testing sets
    split = int(0.8 * len(X))
    X_train, X_test = X[:split], X[split:]
    y_train, y_test = y[:split], y[split:]
    
    # Define the RNN architecture
    model = tf.keras.Sequential([
      tf.keras.layers.SimpleRNN(units=32, activation='tanh',
                    input_shape=(window_size, 1)),
      tf.keras.layers.Dense(1)
    ])
    
    # Compile the model
    model.compile(optimizer='adam', loss='mean_squared_error')
    
    # Train the model
    model.fit(X_train, y_train, epochs=50, batch_size=16)
    
    # Evaluate the model
    loss = model.evaluate(X_test, y_test)
    print("Test Loss:", loss)

The resulting output will look like this:

Test Loss: 0.018054470419883728

The test loss is very low, and so we expect the model to be very accurate. Next, the model can be used to predict the unknown values in the next 12 months. Note: Since the original data was rescaled, the predictions made by the RNN need to be converted back into unnormalized data using the command scaler.inverse_transform.

Python Code

    # Predict future values
    future_months = 12
    last_window = scaled_values[-window_size:].reshape(1, window_size, 1)
    predicted_values = []
    for _ in range(future_months):
     prediction = model.predict(last_window)
     predicted_values.append(prediction[0, 0])
     last_window = np.append(last_window[:, 1:, :], prediction.reshape(1, 1, 1), axis=1)
    
    # Inverse transform the predicted births to get actual values
    predicted_values = scaler.inverse_transform(np.array(predicted_values).reshape(-1, 1))
    print("Predicted Values for the Next 12 Months:\n", predicted_values.flatten())

The resulting output will look like this:

Predicted Values for the Next 12 Months:
[49227.05 49190.258 41204.16 30520.389 29591.129 34516.586 42183.016
38737.39 32026.324 23482.027 23381.914 30206.467]

Finally, here is the plot containing the original data together with the predicted values. Note: There are a lot of lines of code devoted to formatting the data and axis labels before graphing. This is, unfortunately, the nature of the beast. The full details of what each piece of code does will not be explained thoroughly in this text.

Python Code

    import matplotlib.pyplot as plt
    from matplotlib.ticker import MaxNLocator ## for graph formatting
    from matplotlib.ticker import FuncFormatter ## for formatting y-axis
    
    # Function to format the Y-axis values
    def y_format(value, tick_number):
     return f'{value:,.0f}'
    
    # Plot original time series data and predicted values
    plt.figure(figsize=(10, 6))
    plt.plot(data['Month'], data['Value'], label='Original Data', color='blue')
    plt.plot(range(len(data['Value']), len(data['Value']) + future_months), predicted_values,
        label='Predicted Values', color='red')
    
    # Create a list for the months
    pd_months = pd.date_range(start='2016-01-01', end='2023-07-01', freq='MS')
    
    xticks_positions = range(0, len(pd_months), 6) # Positions to display ticks (e.g., every 12 months)
    xticks_labels = [pd_months[pos].strftime('%Y-%m') for pos in xticks_positions] # Labels corresponding to positions
    
    plt.xlabel('Month')
    plt.ylabel('Value')
    plt.title('Monthly Consumption of Coal from 2016 to 2022 and Predicted Values for 2023')
    plt.legend()
    plt.xticks(ticks=xticks_positions, labels=xticks_labels, rotation=45)
    # Apply the formatter to the Y-axis
    plt.gca().yaxis.set_major_formatter(FuncFormatter(y_format))
    
    plt.show()

The resulting output will look like this:

A line chart titled “Monthly consumption of coal from 2016 to 2022 and predicted values for 2023.” The X axis has months from January 2016 to July 2023. The Y axis ranges from 0 to 80,000. A jagged blue line shows ups and downs with a generally downward trend over time. The predicted value of the final 12 months is represented by a red line.

7.2 Backpropagation

Learning Outcomes

Backpropagation

Static Backpropagation

Error and Loss

Problem

Solution

Gradient Descent and Backpropagation

Training a Neural Network Using Backpropagation in Python using TensorFlow

Recurrent Neural Networks

Unrolling an RNN

Vanishing/Exploding Gradient Problem

Long Short-Term Memory Networks

RNNs in Python