Learning Outcomes
By the end of this section, you should be able to:
- 7.1.1 Define neural networks and discuss the types of problems for which they may be useful.
- 7.1.2 Summarize the roles of weights and biases in a neural network.
- 7.1.3 Construct a simple neural network.
Just imagine what must go on inside the human brain when tasked with recognizing digits, given the varying ways the digits 0-9 can be hand-drawn. Human readers can instantly classify these digits, even from an early age. The goal of neural networks is for a computer algorithm to be able to classify these digits as well as a human.
Exploring Further
MNIST Database
The MNIST (Modified National Institute of Standards and Technology) database is a large dataset of handwritten digits that is frequently used for training various image processing and machine learning systems. This video presentation by Michael Garris, senior scientist, provides an overview. You may download the MNIST dataset files directly for further exploration.
In this section, we introduce simple models of neural networks and discover how they work. Many technical details are deferred to later sections or are outside the scope of this text.
What Is a Neural Network?
A neural network is a structure made up of components called neurons, which are individual decision-making units that take some number of inputs and produce an output , as in Figure 7.2. How the output is determined by the inputs could be quite complex. Moreover, any two neurons may behave quite differently from each other on the same inputs.
The neural network itself may consist of hundreds, thousands, or even millions of neurons connected to each other in layers, or groups of neurons that all receive the same inputs from previous layers and forward signals in aggregate to the next layer. There are always at least two layers, the input layer (containing neurons that accept the initial input data) and output layer (containing the neurons that are used to interpret the answer or give classification information), together with some number of hidden layers (layers between the input and output layers). Figure 7.3 shows a simple neural network diagram for reference.
The main purpose of a neural network is to classify complex data. Problems for which neural networks are especially well suited include the following:
- Image recognition, including facial recognition, identifying handwritten letters and symbols, and classifying parts of images. This is a huge area of innovation, with powerful tools such as TensorFlow developed by Google and PyTorch developed by Meta.
- Speech recognition, such as Google’s Cloud Speech-to-Text service, offers accurate transcription of speech and translation for various languages.
- Recommendation systems, which are used to serve ads online based on each user’s browsing habits, have been developed and used by Amazon, Meta, Netflix, and many other large companies to reach target markets.
- Anomaly detection has been developed to aid in fraud detection, data security, and error analysis/correction by finding outliers in large, complex datasets. An example is Microsoft’s Azure Anomaly Detector, which can detect anomalies in time series data.
- Autonomous vehicles and robotics, including Tesla’s Autopilot technology, are becoming more and more prominent as automation alleviates some of the routine aspects of daily life and leads to increased efficiencies in business, manufacturing, and transportation.
- Generative art, examples of which include visual art, music, video, and poetry, leverage vast stores of human creative output to produce novel variations.
- Predictive text, including natural language processing models such as ChatGPT (see Convolutional Neural Networks).
Exploring Further
TensorFlow
If you want to get your feet wet with neural networks, check out this interactive web-based neural network tool, called TensorFlow Playground, which uses TensorFlow to train and update outputs in real time. There, you can choose a dataset from a list, adjust the number of hidden layers and neurons per layer by clicking the plus (+) and minus () buttons, and adjust the learning rate, choice of activation function, and other parameters (topics that we will learn more about in the rest of the chapter). Then click the “play” button to start training the model and watch as it learns how to classify the points in your chosen dataset! If you want to start over, just click “reset” and start from scratch!
Neurons, Weights, Biases, and Activation Functions
The way neurons work in a neural network is conjectured to be similar, or analogous, to how neurons work in the brain. The input of the neural network is fed into the neurons of the input layer. Each neuron then processes it and produces an output, which is in turn pushed to the neurons in the next layer. Individual neurons only send signals, or activate, if they receive the appropriate input required to activate. (Activation is the process of sending an output signal after having received appropriate input signals.) After passing through some number of hidden layers, the output of the last hidden layer feeds into the output layer. Lastly, the output of this final layer is interpreted based on the nature of the problem. If there is a single output neuron, then the interpretation could be true if that neuron is activated or false if not. For neural networks used to classify input into various classes (e.g., number recognition), there is usually one output neuron per class. The one that is most activated would indicate the classification, as shown in Figure 7.4.
Each connection from one neuron to another has two parameters associated with it. The weight is a value that is multiplied to the incoming signal, essentially determining the strength of the connection. The bias is a value that is added to the weighted signal, making the neuron more likely (or less likely, if is negative) to activate on any given input. The values of and may be positive, negative, or zero. Once the weight and bias are applied, the result is run through an activation function, which is a non-decreasing function that determines whether the neuron activates and, if so, how strongly. (You’ll see this later in Figure 7.6.)
Consider the simplest case of all, a neuron with single input and output . Then the flow of signal through the neuron follows the formula . For example, suppose that the input value is , with weight and bias . Then the signal would first be combined as . This value would be fed as input to the activation function to produce . In the next section, we will discuss various activation functions used in neural networks.
When there are multiple inputs, , each will be affected by its own weight. Typically, bias is thought of as a property of the neuron itself and so bias affects all the inputs in the same way. To be more precise, first, the inputs are multiplied by their individual weights, the result is summed, and then the bias is added. Finally, the activation function is applied to obtain the output signal.
The weights and biases are parameters that the neural network learns during the training process, a topic we will explain in Backpropagation.
A typical neural network may have hundreds or thousands of inputs for each neuron, and so the equation can be difficult to work with. It would be more convenient to regard all the inputs as parts of a single mathematical structure, called a vector. A vector is simply an ordered list of numbers, that is, . (Note: In this text we use boldface letters to represent vectors. Also, the components of a vector are listed within a set of parentheses. Some texts vary on these notational details.) The number of components in a vector is called its dimension, so for example the vector has dimension 5. Certain arithmetic operations are defined on vectors.
- Vectors of the same dimension can be added: If and , then .
- Any real number can be multiplied to a vector: .
- The dot product of two vectors of the same dimension results in a real number (not a vector) and is defined by:
If the inputs and weights are regarded as vectors, and , respectively, then the formula may be re-expressed more concisely as:
For example, if , , and , then
So in this example, the output would be , the exact value of which depends on which activation function is chosen.
Types of Activation Functions
Activation functions come in many types. Here are just a few of the most common activation functions.
- Step function, . The value of serves as a threshold. The neuron only activates when the input is at least equal to a parameter .
- Sigmoid function, . Note, this is the same sigmoid function used in logistic regression (see Classification Using Machine Learning). Output values tend to be close to 0 when is negative and close to 1 when is positive, with a smooth transition in between.
- Hyperbolic tangent (tanh) function, . Output values have the same sign as , with a smooth transition through 0.
- Rectified linear unit (ReLU) function, . The ReLU function is 0 for negative values and equal to the input when is positive.
- Leaky ReLU function, , for some small positive parameter . Leaky ReLU acts much like ReLU except that the values get progressively more negative when gets more negative. Often, the optimal “leakiness” parameter is determined during the training phase.
- Softplus function, , which is a smoothed version of the ReLU function.
Figure 7.5 shows the graphs of the listed functions. A key feature of activation functions is that they are nonlinear, meaning that they are not just straight lines, .
Another activation function that is important in neural networks, softmax, takes a vector of real number values and yields a vector of values scaled into the interval between 0 and 1, which can be interpreted as discrete probability distribution. (Recall from Discrete and Continuous Probability Distributions that discrete probability distributions provide measures of probability or likelihood that each of a finite number of values might occur.) The formula for softmax is:
where . We will encounter softmax activation in an example in Backpropagation.
Example 7.1
Problem
A simple neuron has four inputs, , and one output, . The weights of the four inputs are , , , and . The bias of the neuron is . Find the output in each of the following cases.
- . Activation function: ReLU.
- . Activation function: sigmoid.
- . Activation function: step, .
- . Activation function: softplus.
Solution
Setup for each part:
- , since
- (Note, 0.32 was already computed in part a.)
- , since (Note, 0.32 was already computed in part a.)
In the next section, we explore neural networks that use the simplest of the activation functions, the step function.
Perceptrons
Although the idea of using structures modeled after biological neurons had been around prior to the development of what we would now consider neural networks, a significant breakthrough occurred in the middle of the 20th century. In 1957, Frank Rosenblatt developed the perceptron, a type of single-layer neural network designed for binary classification tasks—in other words, a neural network whose output could be “true” or “false” (see Figure 7.7). After initial successes, progress in AI faced challenges due to limited capabilities of AI programs and the lack of computing power, leading to a period of minimal progress, called the “Winter of AI.” Despite these setbacks, new developments occurred in the 1980s leading to the multilayer perceptron (MLP), which serves as the basic paradigm for neural networks having multiple hidden layers.
The single-layer perceptron learns weights through a simple learning rule, known as the perceptron learning rule.
- Initialize the weights () and bias () to small random values.
- For each input () in the training set,
- Compute predictions , where is the step function,
- Find the error , where is the true or desired value corresponding to . Note that can be positive or negative. Positive implies the weights and/or bias is too low and needs to be raised. Negative implies the opposite.
- Update the weights and biases according to the following formulas, where is a small positive constant that controls the learning rate. Often the change of weights and biases will be small, not immediately causing a large effect on input. However, with repeated training, the perceptron will learn the appropriate values of and to achieve the desired output.
For example, suppose a perception with three neurons currently has weights and bias . On input , we get:
Suppose that the true output should have been . So there’s an error of . If the learning rate is , then the weights and bias will be updated as follows:
On the same input, the perceptron now has a value of:
In this simple example, the value changed from 1 to 0, eliminating the error. However, there is no guarantee that the perceptron will classify all input without error, regardless of the number of training steps that are taken.
Example 7.2
Problem
Suppose the next data point in the training set is , and suppose that the correct classification for this point is . With current weights and bias , use the perceptron learning rule with to update the weights and bias. Is there an improvement in classifying this data point?
Solution
Error: .
Now we obtain , which still misclassifies the point as 0, so no improvement in accuracy. (However, at least the argument has increased from to , so some progress has been made in training. In practice, many rounds of training may be necessary to achieve accurate results, with only incremental progress in each round.)
Building a Simple Neural Network
In What Is Data Science?, Example 1.7, we looked at the Iris Flower dataset to become familiar with dataset formats and structures. In this section, we will use the Iris dataset, iris.csv, to create a neural network with four inputs and one output neuron (essentially, a simple perceptron) to classify irises as either setosa or not setosa, according to four features: sepal length (SL), sepal width (SW), petal length (PL), and petal width (PW), all of which are measured in centimeters. The Iris dataset iris.csv
has 150 samples of irises, consisting of 50 of each species, setosa, versicolor, and virginica. For the purposes of this example, the first 100 rows of data were selected for training and testing, using 75% of the data for training and the remaining 25% for testing. (Recall from Decision-Making Using Machine Learning Basics that it is good practice to use about 70%–80% of the data for training, with the rest used for testing.) The network will be trained using the perceptron learning rule, but instead of using the step function, the activation function will be the hyperbolic tangent, , which is chosen simply for illustration purposes.
The Iris Flower Dataset
The iris database was collected and used by Sir R. A. Fisher for his 1936 paper “The Use of Multiple Measurements in Taxonomic Problems.” This work became a landmark study in the use of multivariate data in classification problems and frequently makes an appearance in data science as a convenient test case for machine learning and neural network algorithms.
The model for the neural network is shown in Figure 7.6.
The inputs will take the values of SL, SW, PL, and PW, respectively. There are four corresponding weights and one bias value , which are applied to the input layer values, and then the result is fed into the tanh function to obtain the output . Thus, the formula used to produce the predicted values, denoted by , is:
In this very simplistic model, we will interpret the value of as follows:
- If , classify the input as setosa.
- If , classify the input as not setosa.
As for classifying error, we need to pick concrete target values to compare our outputs with. Let’s consider the target value for classifying setosa as , while the target for classifying not setosa is . This is reasonable as the values of approach as increases to positive infinity and as decreases to negative infinity.
For example, if after training, the ideal weights were found to be , and the bias is , then the response on an input would be:
Since , classify the input as setosa. Of course, the natural question would be, how do we obtain those particular weights and bias values in the first place? This complicated task is best suited for a computer. We will use Python to train the perceptron.
The Python library sklearn.datasets has a function, load_iris
, that automatically loads the Iris dataset, so you should not need to import the dataset iris.csv. Other modules that are imported from sklearn
are Perceptron
, used to build the perceptron model, train_test_split
, used to randomly split a dataset into a testing set and training set. Here, we will use a split of 75% train and 25% test, which is handled by the parameter test_size=0.25
. We will also renormalize the data so that all features contribute equally to the model. The way to do this in Python is by using StandardScalar
, which renormalizes the features of the dataset so that the mean is 0 and standard deviation is 1. Finally, accuracy_score
is used to compute the accuracy of the classification.
Python Code
# import the sklearn library (specific modules)
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
# Load the Iris dataset
iris = load_iris()
X = iris.data[:100] # Only taking first 100 samples for binary classification
y = iris.target[:100] # "0 for setosa", 1 for "not setosa"
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
# Standardize features by removing the mean and scaling to unit variance
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Initialize and train the perceptron
perceptron = Perceptron()
perceptron.fit(X_train_scaled, y_train)
# Make predictions on the test set
y_pred = perceptron.predict(X_test_scaled)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
The resulting output will look like this:
Accuracy: 1.0
Because the dataset was relatively small (100 rows) and well-structured, the perceptron model was able to classify the data with 100% accuracy (“Accuracy: 1.0”). In the real world, when datasets are larger and less regular, we do not expect 100% accuracy. There could also be significant overfitting in the model, which would lead to high variance when classifying data not found in the original dataset.
Here is how to use the perceptron model we just created to classify new data (outside of the original dataset):
Python Code
# Assume you have new inputs stored in a variable called 'new_inputs'
new_inputs = [[5.1, 3.5, 1.4, 0.2], # Example input 1
[6.3, 2.9, 5.6, 1.8] # Example input 2
]
# Standardize the new inputs using the same scaler used for training
new_inputs_scaled = scaler.transform(new_inputs)
# Use the trained perceptron model to make predictions on the new inputs
predictions = perceptron.predict(new_inputs_scaled)
# Display the predictions
for i, prediction in enumerate(predictions):
print(f"Input {i+1} prediction: {'Setosa' if prediction == 0 else 'Not Setosa'}")
The resulting output will look like this:
Input 1 prediction: Setosa
Input 2 prediction: Not Setosa