FIGURE 5 filter learned on first(top) and second layer(Bottom) bases. Second-layer filters can be viewed as a linear combination of the first-layer basis.
4.1.2 | Autoencoder (AE)
The traditional autoencoder (AE) is a three-layered, fully connected network. Unlike MLP, AE is divided into two sections: encoder and decoder. The middle section has the lowest dimensions and is called the bottleneck layer. AE with a single hidden layer with linear activation can be similar to the principal component analysis (PCA). PCA is a widely used dimensionality reduction technic that gives orthogonal basis functions. It is a linear method and provides excellent results; however, it ignores features with low variances; it can omit some information that may be vital for feature learning. Furthermore, most datasets are nonlinear; autoencoder is adopted as a nonlinear dimensionality reduction method to overcome the limitation of PCA.
The deep autoencoder has more than three layers with nonlinear activation functions. The typical structure of the AE can be explained in Figure 6. Initially, the RBM architecture was used as the building block of deep AE. This method reshapes the image to 1D before processing it further. This original version of AE is more like a stacked RBM with reduced width of layers as it goes deep into the encoding section. As mentioned earlier, the AE has encoding and decoding sections; generally, the filters in the decoding section are the transpose of the encoding filters. This arrangement aims to preserve the symmetry of the architecture and reduce the parameters to be trained. AE is an unsupervised method that creates a latent representation of the input by trying to regenerate the same input by the decoder and minimize the error. Stochastic gradient descent (SGD) is widely used to optimize the filters.
FIGURE 6 Typical autoencoder architecture. The first half is the encoding section resulting in a bottleneck layer (middle section), followed by a decoding section which is mostly the inverse of the encoding section.
With the advancement of CNN, the AE got convolution layers instead of RBM as the building block. However, the 1D version of the image does not preserve information of 2D or 3D image layout. The issue is resolved in a modified architecture named convolutional AE (CAE). The deep CAE, commonly known as CAE, is used for object classification as a semi-supervised and unsupervised technic based on the method with which it is merged. In both cases, the compressed bottleneck core is calculated and used as the input to any classification model like MLP with SoftMax classifier, SVM, or to the clustering method like Gaussian Mixture Model (GMM), K-means , and self-organizing map (SOM). Many combinations of multiple methods are widely studied and used as self-supervised, semi-supervised or unsupervised architectures. It was noted that decreasing the number of nodes with depth would be a trade-off between dimensionality reduction and feature extraction. The rate of which solely depends on the objective function of the application. When AE is used to regenerate an image’s missing sections, the bottleneck layer’s specific dimension is assumed unnecessary.
On the other side, the bottleneck layer is the essence of the data when AE is used for a dimensionality reduction-based feature extractor. The optimum size of the bottleneck layer is seen as dataset-dependent and a hyperparameter. For example, it was chosen 256D for ORL and Yale Dataset, 60D for MNIST, 100D for Fashion-MNIST, 160D for the COIL20 dataset, and 160D for the Astronomical dataset. The encoder section in CAE is usually derived from successful CNN models like Alexnet, ResNet or random convolution layers that inherit the properties of a typical CNN architecture. The filter size remains 3x3, 5x5, and 7x7, the same as traditional CNN, while the number of filters and depth remain hyperparameters.
The variational auto-encoder (VAE) offers better organization of the latent space than typical AE by adding a loss function with a regularization term. The standard AE is prone to learn only the identity function during the training if hidden layers are increased. Denoising autoencoder (DA) counters this problem by adding noise to the input, forcing AE to denoise and reconstruct it. The DA offers excellent denoise capabilities with no distinct difference in filters, as shown in Figure 7. In the experiment, The filter size was set 7x7, 5x5, and 3x3 over MNIST and CIFAR-10 datasets with max-pooling layers of 2x2.