Dr. Shaun V. Ault; Dr. Soohyun Nam Liao; Larry Musolino

Learning Outcomes

By the end of this section, you should be able to:

7.4.1 Define convolutional neural networks and describe some problems where they may perform better than standard neural networks.
7.4.2 Describe feature maps.
7.4.3 Train and test a convolutional neural network on image data and discuss the results.

Convolutional neural networks (CNNs) are a powerful class of neural network models developed to process structured, grid-like data, such as images, making use of the mathematical operation of convolution (which is similar to applying a filter or mask to an image). CNNs are useful for image classification, locating objects within images, edge detection, and capturing spatial relationships that exist within an image. We’ll discuss these applications, and we'll explore the concept of feature maps, which are the output of convolutional layers in CNNs and represent the learned features of the input data. By the end of this section, you should see how to train and test a CNN to classify image data using Python.

The Components of Convolutional Neural Networks

CNNs consist of multiple hidden layers, including convolutional layers, pooling layers, and fully connected layers. The key components of a CNN are as follows:

Convolutional layers: These layers apply “filters” to the input data that are capable of picking up on distinct structural features, such as vertical or horizontal lines/edges in an image. This is where the mathematical convolution operations are applied and feature maps produced, topics we shall discuss further later in the chapter.
Pooling layers: Pooling layers down-sample the feature maps produced by the convolutional layers, reducing the spatial dimensions of the data while retaining important information. To illustrate, consider the list of data [0, 0, 0, 1, 1, 1, 0, 0, 0, 2, 2, 2]. This data can be compressed to [0, 1, 0, 2]. (In real applications, some kind of averaging or voting procedure is used to determine the down-sampled values.)
Fully connected layers: These layers are typically used at the end of the network to perform classification or regression tasks based on the learned features extracted by the convolutional layers. Fully connected layers connect every neuron in one layer to every neuron in the next layer, as in a traditional neural network.

CNNs excel in tasks that include capturing spatial patterns and finding hierarchical features in images. In addition to image classification and object detection, CNNs can be used for semantic segmentation, which involves classifying each pixel of an image into a specific category or class. For example, in Figure 7.11, the parts of an image are analyzed and marked as different types, such as buildings, vehicles, and road. This kind of classification is an important aspect of developing safe self-driving cars.

An image of a highway scene with semantic segmentation created by AI. Blue indicates a small vehicle,and yellow indicates vans and trucks; gray indicates the road, trees and other foliage are in green; the highway signs are in red, and aqua blue is the sky.

Figure 7.11 Semantic Segmentation of an Image. Small cars appear in blue, large vans/trucks in yellow, traffic signs in red, and buildings and other barriers in brown. (credit: "Semantic Segmentation Image Annotation Kotwel" by Kotwel Inc./Flickr, CC BY 2.0)

Feature Maps

In a typical CNN, after the data has been sent through convolutional and pooling layers, the CNN can then start to extract various features from the input. This step relies on filters or kernels that are specifically designed to pick up on specific patterns or features, such as edges, textures, or shapes, in different regions of the input data. The output of applying these filters to the input data results in feature maps, which are essentially two-dimensional arrays of values representing the activations of the filters across the spatial dimensions of the input. Each element of a feature map corresponds to the activation of a specific filter at a particular location in the input data.

Feature maps play a crucial role in the representation learning process of CNNs. They capture hierarchical features at different levels of abstraction, with early layers capturing low-level features (e.g., edges, corners) and deeper layers capturing more complex and abstract features (e.g., object parts, semantic concepts). By stacking multiple convolutional layers, CNNs learn to progressively extract (and abstract) features from raw input data, enabling them to effectively model complex patterns and relationships in the data.

Examples of some common features that may be captured in feature maps include the following:

Edges and textures. These are the fundamental features useful in recognizing digits or letters as well as well-defined boundaries between parts of images.
Color and intensity variations. Determining actual color information from a digital image is often very difficult. The same color may appear drastically different depending on factors such as lighting, surrounding colors, and intensity.
Shapes and structures. Building upon edge-detection features, shapes and larger structures can be built up as higher-level features.
Object parts and high-level patterns. Even higher in the hierarchy of image features, objects may be analyzed into parts, and non-obvious patterns may be discovered by the CNN.
Semantic concepts (see semantic segmentation, mentioned previously). Outside of the realm of image analysis, feature maps may be synthesized to detect semantic concepts—that is, parts of the data that work together to form meaningful objects or ideas. Learning to recognize semantic concepts is key to the successful operation of natural language learning models.
Temporal dynamics. In sequential data, such as audio signals or time series data, feature maps may capture temporal dynamics and patterns over time. These features help the network understand temporal dependencies and make predictions based on sequential input.

Image Analysis Using CNNs

Image analysis using semantic segmentation is a computer vision task that involves assigning semantic labels (meanings) to each pixel in an image, thus dividing the image into semantically meaningful regions or segments. Unlike traditional image classification tasks, where the goal is to assign a single label to the entire image, semantic segmentation provides a pixel-level understanding of the scene by labeling each pixel with the corresponding class or category it belongs to.

In semantic segmentation, each pixel in the image is assigned a label that represents the object or category it belongs to, such as “person,” “vehicle,” “tree,” or “road.” The output is a segmented image where different regions or objects are delineated by distinct colors or labels (as shown in Figure 7.11). This fine-grained understanding of the image allows for more detailed analysis and interpretation of visual scenes, making semantic segmentation a crucial task in various computer vision applications, including autonomous driving, medical image analysis, scene understanding, and augmented reality.

Semantic segmentation is typically performed using deep learning techniques, particularly convolutional neural networks, which have shown remarkable success in capturing spatial dependencies and learning complex visual patterns. By training CNNs on annotated datasets containing images and corresponding pixel-level labels, semantic segmentation models can learn to accurately segment objects and scenes in unseen images, enabling a wide range of advanced visual perception tasks.

The process of training a CNN to perform sematic segmentation is more involved than in image classification tasks. It is a supervised learning task, in that the training set must have labels assigned to each region of each image. The tedious work of labeling regions of images falls to human researchers.

7.4 Convolutional Neural Networks

Learning Outcomes

The Components of Convolutional Neural Networks

Feature Maps

Image Analysis Using CNNs