Skip to ContentGo to accessibility pageKeyboard shortcuts menu
OpenStax Logo
Principles of Data Science

7.1 Introduction to Neural Networks

Principles of Data Science7.1 Introduction to Neural Networks

Learning Outcomes

By the end of this section, you should be able to:

  • 7.1.1 Define neural networks and discuss the types of problems for which they may be useful.
  • 7.1.2 Summarize the roles of weights and biases in a neural network.
  • 7.1.3 Construct a simple neural network.

Just imagine what must go on inside the human brain when tasked with recognizing digits, given the varying ways the digits 0-9 can be hand-drawn. Human readers can instantly classify these digits, even from an early age. The goal of neural networks is for a computer algorithm to be able to classify these digits as well as a human.

Exploring Further

MNIST Database

The MNIST (Modified National Institute of Standards and Technology) database is a large dataset of handwritten digits that is frequently used for training various image processing and machine learning systems. This video presentation by Michael Garris, senior scientist, provides an overview. You may download the MNIST dataset files directly for further exploration.

In this section, we introduce simple models of neural networks and discover how they work. Many technical details are deferred to later sections or are outside the scope of this text.

What Is a Neural Network?

A neural network is a structure made up of components called neurons, which are individual decision-making units that take some number of inputs x1,x2,,xnx1,x2,,xn and produce an output yy, as in Figure 7.2. How the output is determined by the inputs could be quite complex. Moreover, any two neurons may behave quite differently from each other on the same inputs.

Diagram of a neural network with a single neuron in the middle. There are three inputs labeled X1, X2, and X3 and one output labeled y.
Figure 7.2 Single Neuron in a Neural Network

The neural network itself may consist of hundreds, thousands, or even millions of neurons connected to each other in layers, or groups of neurons that all receive the same inputs from previous layers and forward signals in aggregate to the next layer. There are always at least two layers, the input layer (containing neurons that accept the initial input data) and output layer (containing the neurons that are used to interpret the answer or give classification information), together with some number of hidden layers (layers between the input and output layers). Figure 7.3 shows a simple neural network diagram for reference.

A diagram showing a simple artificial neural network with three input nodes, two hidden layers, and one output node. The input layer consists of three nodes labeled x1, x2, and x3. The first hidden layer has four nodes. The second hidden layer has four nodes. All nodes in each layer are connected to all nodes in the next layer. The output layer consists of one node labeled y.
Figure 7.3 Neural Network Diagram. This neural network has four layers, two of which are hidden layers. There are three input neurons and one output neuron. In this example, all nodes in adjacent layers are connected, but some neural network models may not include all such connections (see, for example, convolutional neural networks in Introduction to Deep Learning).

The main purpose of a neural network is to classify complex data. Problems for which neural networks are especially well suited include the following:

  • Image recognition, including facial recognition, identifying handwritten letters and symbols, and classifying parts of images. This is a huge area of innovation, with powerful tools such as TensorFlow developed by Google and PyTorch developed by Meta.
  • Speech recognition, such as Google’s Cloud Speech-to-Text service, offers accurate transcription of speech and translation for various languages.
  • Recommendation systems, which are used to serve ads online based on each user’s browsing habits, have been developed and used by Amazon, Meta, Netflix, and many other large companies to reach target markets.
  • Anomaly detection has been developed to aid in fraud detection, data security, and error analysis/correction by finding outliers in large, complex datasets. An example is Microsoft’s Azure Anomaly Detector, which can detect anomalies in time series data.
  • Autonomous vehicles and robotics, including Tesla’s Autopilot technology, are becoming more and more prominent as automation alleviates some of the routine aspects of daily life and leads to increased efficiencies in business, manufacturing, and transportation.
  • Generative art, examples of which include visual art, music, video, and poetry, leverage vast stores of human creative output to produce novel variations.
  • Predictive text, including natural language processing models such as ChatGPT (see Convolutional Neural Networks).

Exploring Further

TensorFlow

If you want to get your feet wet with neural networks, check out this interactive web-based neural network tool, called TensorFlow Playground, which uses TensorFlow to train and update outputs in real time. There, you can choose a dataset from a list, adjust the number of hidden layers and neurons per layer by clicking the plus (+) and minus (--) buttons, and adjust the learning rate, choice of activation function, and other parameters (topics that we will learn more about in the rest of the chapter). Then click the “play” button to start training the model and watch as it learns how to classify the points in your chosen dataset! If you want to start over, just click “reset” and start from scratch!

Neurons, Weights, Biases, and Activation Functions

The way neurons work in a neural network is conjectured to be similar, or analogous, to how neurons work in the brain. The input of the neural network is fed into the neurons of the input layer. Each neuron then processes it and produces an output, which is in turn pushed to the neurons in the next layer. Individual neurons only send signals, or activate, if they receive the appropriate input required to activate. (Activation is the process of sending an output signal after having received appropriate input signals.) After passing through some number of hidden layers, the output of the last hidden layer feeds into the output layer. Lastly, the output of this final layer is interpreted based on the nature of the problem. If there is a single output neuron, then the interpretation could be true if that neuron is activated or false if not. For neural networks used to classify input into various classes (e.g., number recognition), there is usually one output neuron per class. The one that is most activated would indicate the classification, as shown in Figure 7.4.

A neural network diagram with four output neurons, labeled A (0.63), B (0.81), C (0.32), and D (0.09). As the highest activation level, B is highlighted in green and the output of the neural network is B.
Figure 7.4 Output Neurons. In this figure, there are four output neurons, labeled A, B, C, and D. Since B has the highest activation level, the output of the neural network is B.

Each connection from one neuron to another has two parameters associated with it. The weight is a value ww that is multiplied to the incoming signal, essentially determining the strength of the connection. The bias is a value bb that is added to the weighted signal, making the neuron more likely (or less likely, if bb is negative) to activate on any given input. The values of ww and bb may be positive, negative, or zero. Once the weight and bias are applied, the result is run through an activation function, which is a non-decreasing function ff that determines whether the neuron activates and, if so, how strongly. (You’ll see this later in Figure 7.6.)

Consider the simplest case of all, a neuron with single input xx and output yy. Then the flow of signal through the neuron follows the formula y=f(wx+b)y=f(wx+b). For example, suppose that the input value is x=0.87x=0.87, with weight w=0.53w=0.53 and bias b=0.12b=0.12. Then the signal would first be combined as wx+b=(0.53)(0.87)+(0.12)=0.3411wx+b=(0.53)(0.87)+(0.12)=0.3411. This value would be fed as input to the activation function to produce y=f(0.3411)y=f(0.3411). In the next section, we will discuss various activation functions used in neural networks.

When there are multiple inputs, x1,x2,,xnx1,x2,,xn, each will be affected by its own weight. Typically, bias is thought of as a property of the neuron itself and so bias affects all the inputs in the same way. To be more precise, first, the inputs are multiplied by their individual weights, the result is summed, and then the bias is added. Finally, the activation function is applied to obtain the output signal.

y=f(w1x1+w2x2++wnxn+b)=f((i=1nwixi)+b)y=f(w1x1+w2x2++wnxn+b)=f((i=1nwixi)+b)

The weights and biases are parameters that the neural network learns during the training process, a topic we will explain in Backpropagation.

A typical neural network may have hundreds or thousands of inputs for each neuron, and so the equation can be difficult to work with. It would be more convenient to regard all the inputs x1,x2,,xnx1,x2,,xn as parts of a single mathematical structure, called a vector. A vector is simply an ordered list of numbers, that is, x=(x1,x2,,xn)x=(x1,x2,,xn). (Note: In this text we use boldface letters to represent vectors. Also, the components of a vector are listed within a set of parentheses. Some texts vary on these notational details.) The number of components in a vector is called its dimension, so for example the vector (5,0,2,1.2,π)(5,0,2,1.2,π) has dimension 5. Certain arithmetic operations are defined on vectors.

  • Vectors of the same dimension can be added: If x=(x1,x2,,xn)x=(x1,x2,,xn) and y=(y1,y2,,yn)y=(y1,y2,,yn), then x+y=(x1+y1,x2+y2,,xn+yn)x+y=(x1+y1,x2+y2,,xn+yn).
  • Any real number can be multiplied to a vector: kx=k(x1,x2,,xn)=(kx1,kx2,,kxn)kx=k(x1,x2,,xn)=(kx1,kx2,,kxn).
  • The dot product of two vectors of the same dimension results in a real number (not a vector) and is defined by:
    x·y=(x1,x2,,xn)·(y1,y2,,yn)=i=1nxiyi=x1y1+x2y2+x3y3++xnynx·y=(x1,x2,,xn)·(y1,y2,,yn)=i=1nxiyi=x1y1+x2y2+x3y3++xnyn

If the inputs and weights are regarded as vectors, x=(x1,x2,,xn)x=(x1,x2,,xn) and w=(w1,w2,,wn)w=(w1,w2,,wn), respectively, then the formula may be re-expressed more concisely as:

y=f(w·x+b)y=f(w·x+b)

For example, if w=(0.3,0.1,0.9)w=(0.3,0.1,0.9), x=(1.2,0.4,0.6)x=(1.2,0.4,0.6), and b=0.1b=0.1, then

w·x+b=(0.3)(1.2)+(0.1)(0.4)+(0.9)(0.6)+0.1=0.12w·x+b=(0.3)(1.2)+(0.1)(0.4)+(0.9)(0.6)+0.1=0.12

So in this example, the output would be y=f(0.12)y=f(0.12), the exact value of which depends on which activation function f(x)f(x) is chosen.

Types of Activation Functions

Activation functions come in many types. Here are just a few of the most common activation functions.

  1. Step function, f(x)={0,ifx<c1,ifxcf(x)={0,ifx<c1,ifxc. The value of cc serves as a threshold. The neuron only activates when the input is at least equal to a parameter cc.
  2. Sigmoid function, σ(x)=11+exσ(x)=11+ex. Note, this is the same sigmoid function used in logistic regression (see Classification Using Machine Learning). Output values tend to be close to 0 when xx is negative and close to 1 when xx is positive, with a smooth transition in between.
  3. Hyperbolic tangent (tanh) function, tanh x=exexex+extanh x=exexex+ex. Output values have the same sign as xx, with a smooth transition through 0.
  4. Rectified linear unit (ReLU) function, ReLU(x)=max(0,x)ReLU(x)=max(0,x). The ReLU function is 0 for negative xx values and equal to the input when xx is positive.
  5. Leaky ReLU function, LReLU(x)=max(cx,x)LReLU(x)=max(cx,x), for some small positive parameter cc. Leaky ReLU acts much like ReLU except that the values get progressively more negative when xx gets more negative. Often, the optimal “leakiness” parameter cc is determined during the training phase.
  6. Softplus function, f(x)=ln(1+ex)f(x)=ln(1+ex), which is a smoothed version of the ReLU function.

Figure 7.5 shows the graphs of the listed functions. A key feature of activation functions is that they are nonlinear, meaning that they are not just straight lines, f(x)=mx+bf(x)=mx+b.

Another activation function that is important in neural networks, softmax, takes a vector of real number values and yields a vector of values scaled into the interval between 0 and 1, which can be interpreted as discrete probability distribution. (Recall from Discrete and Continuous Probability Distributions that discrete probability distributions provide measures of probability or likelihood that each of a finite number of values might occur.) The formula for softmax is:

σ(x1,x2,,xn)=(ex1D,ex2D,,exnD)σ(x1,x2,,xn)=(ex1D,ex2D,,exnD)

where D=ex1+ex2++exnD=ex1+ex2++exn. We will encounter softmax activation in an example in Backpropagation.

A grid of six plots showing different activation functions commonly used in neural networks. Each plot displays a curve representing the function's output values across a range of input values. Top, from left to right, the graphs are labeled (a) step, (b) sigmoid, and (c) tanh. Bottom, from left to right, the graphs are labeled (d) ReLU, (e) Leaky ReLU, and (f) softplus. X and Y axes are -2 to 2 with a blue line in each graph representing the function.
Figure 7.5 Graphs of Activation Functions. Top, from left to right, (a) step, (b) sigmoid, and (c) hyperbolic tangent (tanh) functions. Bottom, from left to right, (d) ReLU, (e) LReLU, and (f) softplus functions.

Example 7.1

Problem

A simple neuron has four inputs, x1,x2,x3,x4x1,x2,x3,x4, and one output, yy. The weights of the four inputs are w1=0.34w1=0.34, w2=0.07w2=0.07, w3=0.59w3=0.59, and w4=0.21w4=0.21. The bias of the neuron is b=0.19b=0.19. Find the output in each of the following cases.

  1. (x1,x2,x3,x4)=(1,0,0,1)(x1,x2,x3,x4)=(1,0,0,1). Activation function: ReLU.
  2. (x1,x2,x3,x4)=(1,0,0,1)(x1,x2,x3,x4)=(1,0,0,1). Activation function: sigmoid.
  3. (x1,x2,x3,x4)=(1,0,0,1)(x1,x2,x3,x4)=(1,0,0,1). Activation function: step, f(x)={0,ifx<0.51,ifx0. 5f(x)={0,ifx<0.51,ifx0. 5.
  4. (x1,x2,x3,x4)=(0,2.3,1.6,0.8)(x1,x2,x3,x4)=(0,2.3,1.6,0.8). Activation function: softplus.

In the next section, we explore neural networks that use the simplest of the activation functions, the step function.

Perceptrons

Although the idea of using structures modeled after biological neurons had been around prior to the development of what we would now consider neural networks, a significant breakthrough occurred in the middle of the 20th century. In 1957, Frank Rosenblatt developed the perceptron, a type of single-layer neural network designed for binary classification tasks—in other words, a neural network whose output could be “true” or “false” (see Figure 7.7). After initial successes, progress in AI faced challenges due to limited capabilities of AI programs and the lack of computing power, leading to a period of minimal progress, called the “Winter of AI.” Despite these setbacks, new developments occurred in the 1980s leading to the multilayer perceptron (MLP), which serves as the basic paradigm for neural networks having multiple hidden layers.

The single-layer perceptron learns weights through a simple learning rule, known as the perceptron learning rule.

  1. Initialize the weights (ww) and bias (bb) to small random values.
  2. For each input (xx) in the training set,
    1. Compute predictions y^=f(w·x+b)y^=f(w·x+b), where ff is the step function,
      f(x)={0,ifx<01,ifx0f(x)={0,ifx<01,ifx0
    2. Find the error E=yy^E=yy^, where yy is the true or desired value corresponding to xx. Note that EE can be positive or negative. Positive EE implies the weights and/or bias is too low and needs to be raised. Negative EE implies the opposite.
    3. Update the weights and biases according to the following formulas, where hh is a small positive constant that controls the learning rate. Often the change of weights and biases will be small, not immediately causing a large effect on input. However, with repeated training, the perceptron will learn the appropriate values of ww and bb to achieve the desired output.
      ww+hExww+hEx
      bb+hEbb+hE

For example, suppose a perception with three neurons currently has weights w=(0.5,0.7,0.1)w=(0.5,0.7,0.1) and bias b=0.9b=0.9. On input x=(0.6,1,0)x=(0.6,1,0), we get:

y^=f((0.5)(0.6)+(0.7)(1)+(0.1)(0)0.9)=f(0.1)=1y^=f((0.5)(0.6)+(0.7)(1)+(0.1)(0)0.9)=f(0.1)=1

Suppose that the true output should have been y=0y=0. So there’s an error of E=01=1E=01=1. If the learning rate is h=0.1h=0.1, then the weights and bias will be updated as follows:

ww+hEx=(0.5+(0.1)(1)(0.6),0.7+(0.1)(1)(1),0.1+(0.1)(1)(0))=(0.44,0.6,0.1)ww+hEx=(0.5+(0.1)(1)(0.6),0.7+(0.1)(1)(1),0.1+(0.1)(1)(0))=(0.44,0.6,0.1)
bb+hE=0.9+(0.1)(1)=1.0bb+hE=0.9+(0.1)(1)=1.0

On the same input, the perceptron now has a value of:

y^=f((0.44)(0.6)+(0.6)(1)+(0.1)(0)1.0)=f(0.136)=0y^=f((0.44)(0.6)+(0.6)(1)+(0.1)(0)1.0)=f(0.136)=0

In this simple example, the value y^y^ changed from 1 to 0, eliminating the error. However, there is no guarantee that the perceptron will classify all input without error, regardless of the number of training steps that are taken.

Example 7.2

Problem

Suppose the next data point in the training set is x=(0,0.5,1)x=(0,0.5,1), and suppose that the correct classification for this point is y=1y=1. With current weights w=(0.44,0.6,0.1)w=(0.44,0.6,0.1) and bias b=1b=1, use the perceptron learning rule with h=0.1h=0.1 to update the weights and bias. Is there an improvement in classifying this data point?

Building a Simple Neural Network

In What Is Data Science?, Example 1.7, we looked at the Iris Flower dataset to become familiar with dataset formats and structures. In this section, we will use the Iris dataset, iris.csv, to create a neural network with four inputs and one output neuron (essentially, a simple perceptron) to classify irises as either setosa or not setosa, according to four features: sepal length (SL), sepal width (SW), petal length (PL), and petal width (PW), all of which are measured in centimeters. The Iris dataset iris.csv has 150 samples of irises, consisting of 50 of each species, setosa, versicolor, and virginica. For the purposes of this example, the first 100 rows of data were selected for training and testing, using 75% of the data for training and the remaining 25% for testing. (Recall from Decision-Making Using Machine Learning Basics that it is good practice to use about 70%–80% of the data for training, with the rest used for testing.) The network will be trained using the perceptron learning rule, but instead of using the step function, the activation function will be the hyperbolic tangent, tanhx=exexex+extanhx=exexex+ex, which is chosen simply for illustration purposes.

The Iris Flower Dataset

The iris database was collected and used by Sir R. A. Fisher for his 1936 paper “The Use of Multiple Measurements in Taxonomic Problems.” This work became a landmark study in the use of multivariate data in classification problems and frequently makes an appearance in data science as a convenient test case for machine learning and neural network algorithms.

The model for the neural network is shown in Figure 7.6.

A neural network diagram with four inputs labeled X1, X2, X3, and X4. Lines from the inputs to a circle with b in the center (representing bias) are labeled W1, W2, W3, and W4 to represent weights. The bias points to a rectangle labeled tanh and the output is y.
Figure 7.6 Simple Perceptron Neural Network Model. Here, x1,x2,x3,x4x1,x2,x3,x4 are the inputs, w1,w2,w3,w4w1,w2,w3,w4 represent the weights, and bb represents the bias. The output, or response, is yy.

The inputs (x1,x2,x3,x4)(x1,x2,x3,x4) will take the values of SL, SW, PL, and PW, respectively. There are four corresponding weights (w1,w2,w3,w4)(w1,w2,w3,w4) and one bias value bb, which are applied to the input layer values, and then the result is fed into the tanh function to obtain the output yy. Thus, the formula used to produce the predicted yy values, denoted by y^y^, is:

y^=tanh(w1x1+w2x2+w3x3+w4x4+b)y^=tanh(w1x1+w2x2+w3x3+w4x4+b)

In this very simplistic model, we will interpret the value of yy as follows:

  • If y^>0y^>0, classify the input as setosa.
  • If y^<0y^<0, classify the input as not setosa.

As for classifying error, we need to pick concrete target values to compare our outputs with. Let’s consider the target value for classifying setosa as 11, while the target for classifying not setosa is 11. This is reasonable as the values of tanh(x)tanh(x) approach 11 as xx increases to positive infinity and -1-1 as xx decreases to negative infinity.

For example, if after training, the ideal weights were found to be (w1,w2,w3,w4)=(0.2,0.6,0.9,0.1)(w1,w2,w3,w4)=(0.2,0.6,0.9,0.1), and the bias is b=0.5b=0.5, then the response on an input (x1,x2,x3,x4)=(5.1,3.5,1.4,0.2)(x1,x2,x3,x4)=(5.1,3.5,1.4,0.2) would be:

y^=tanh((0.2)(5.1)+(0.6)(3.5)+(0.9)(1.4)+(0.1)(0.2)+0.5)=tanh(0.34)=0.327y^=tanh((0.2)(5.1)+(0.6)(3.5)+(0.9)(1.4)+(0.1)(0.2)+0.5)=tanh(0.34)=0.327

Since y^>0y^>0, classify the input as setosa. Of course, the natural question would be, how do we obtain those particular weights and bias values in the first place? This complicated task is best suited for a computer. We will use Python to train the perceptron.

The Python library sklearn.datasets has a function, load_iris, that automatically loads the Iris dataset, so you should not need to import the dataset iris.csv. Other modules that are imported from sklearn are Perceptron, used to build the perceptron model, train_test_split, used to randomly split a dataset into a testing set and training set. Here, we will use a split of 75% train and 25% test, which is handled by the parameter test_size=0.25. We will also renormalize the data so that all features contribute equally to the model. The way to do this in Python is by using StandardScalar, which renormalizes the features of the dataset so that the mean is 0 and standard deviation is 1. Finally, accuracy_score is used to compute the accuracy of the classification.

Python Code

    
    # import the sklearn library (specific modules)
    from sklearn.datasets import load_iris
    from sklearn.linear_model import Perceptron
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler
    from sklearn.metrics import accuracy_score
    
    # Load the Iris dataset
    iris = load_iris()
    X = iris.data[:100] # Only taking first 100 samples for binary classification
    y = iris.target[:100] # "0 for setosa", 1 for "not setosa"
    
    # Split the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
    
    # Standardize features by removing the mean and scaling to unit variance
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    # Initialize and train the perceptron
    perceptron = Perceptron()
    perceptron.fit(X_train_scaled, y_train)
    
    # Make predictions on the test set
    y_pred = perceptron.predict(X_test_scaled)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    print("Accuracy:", accuracy)
    

The resulting output will look like this:

Accuracy: 1.0

Because the dataset was relatively small (100 rows) and well-structured, the perceptron model was able to classify the data with 100% accuracy (“Accuracy: 1.0”). In the real world, when datasets are larger and less regular, we do not expect 100% accuracy. There could also be significant overfitting in the model, which would lead to high variance when classifying data not found in the original dataset.

Here is how to use the perceptron model we just created to classify new data (outside of the original dataset):

Python Code

    
    # Assume you have new inputs stored in a variable called 'new_inputs'
    new_inputs = [[5.1, 3.5, 1.4, 0.2], # Example input 1
          [6.3, 2.9, 5.6, 1.8] # Example input 2
          ]
    
    # Standardize the new inputs using the same scaler used for training
    new_inputs_scaled = scaler.transform(new_inputs)
    
    # Use the trained perceptron model to make predictions on the new inputs
    predictions = perceptron.predict(new_inputs_scaled)
    
    # Display the predictions
    for i, prediction in enumerate(predictions):
     print(f"Input {i+1} prediction: {'Setosa' if prediction == 0 else 'Not Setosa'}")
    

The resulting output will look like this:

Input 1 prediction: Setosa
Input 2 prediction: Not Setosa
Citation/Attribution

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution-NonCommercial-ShareAlike License and you must attribute OpenStax.

Attribution information
  • If you are redistributing all or part of this book in a print format, then you must include on every physical page the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
  • If you are redistributing all or part of this book in a digital format, then you must include on every digital page view the following attribution:
    Access for free at https://openstax.org/books/principles-data-science/pages/1-introduction
Citation information

© Dec 19, 2024 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.