Convolutional neural network (CNN)
The convolutional neural Network (CNN) has dominated computer vision for almost three decades. To briefly introduce, there can be nothing better than LeNet-5, a pioneering architecture given in 1998. LeNet-5 is made of a total of 7 layers: 3 convolutional layers (C1, C3, and C5), and two subsampling layers (S2 and S4), followed by two fully connected layers (F6 and F7). The core section is the convolutional layers, where the actual learning happens. Images have a 2D spatial structure and can have spectral dimensions for color images (e.g., 3-Dimentional (3D) for RGB images). Convolution can happen to the whole image at once, which would give a scalar point and is not helpful for multidimensional feature learning. Hence input image is divided into smaller image patches, often called activation space. The filter(s) convolve with activation space sliding over the input image in an overlapping or no-overlapping fashion and giving a scalar point for each activation space. The output goes through an activation function, e.g., Rectified linear unit (ReLU), sigmoid, tanh, or variants. The resultant points are placed in the same coordinated space as the activation space in the input image and are called feature maps. Each filter produces a feature map. The next layer is for the subsampling process, which aims to reduce the size of feature maps to reduce the number of parameters and make the algorithm faster. The common subsampling technics are Max-pooling and Average-pooling. As the name suggests, from a window of pixels from the feature map, only the maximum value is selected or the average of pixels from that window, respectively. The pooling window size is a hyperparameter. It is often chosen as 2x2 for small resolution inputs such as the digits database ’MNIST.’ The layers are staked while the model layers go deep. The last convolution layer is usually followed by fully connected layers and a classification or clustering method depending on the objective of the task.