FIGURE
5 filter learned on first(top) and second layer(Bottom) bases.
Second-layer filters can be viewed as a linear combination of the
first-layer basis.
4.1.2 | Autoencoder (AE)
The traditional autoencoder (AE) is a three-layered, fully connected
network. Unlike MLP, AE is divided into two sections: encoder and
decoder. The middle section has the lowest dimensions and is called the
bottleneck layer. AE with a single hidden layer with linear activation
can be similar to the principal component analysis (PCA). PCA is a
widely used dimensionality reduction technic that gives orthogonal basis
functions. It is a linear method and provides excellent results;
however, it ignores features with low variances; it can omit some
information that may be vital for feature learning. Furthermore, most
datasets are nonlinear; autoencoder is adopted as a nonlinear
dimensionality reduction method to overcome the limitation of PCA.
The deep autoencoder has more than three layers with nonlinear
activation functions. The typical structure of the AE can be explained
in Figure 6. Initially, the RBM architecture was used as the building
block of deep AE. This method reshapes the image to 1D before processing
it further. This original version
of AE is more like a stacked RBM with reduced width of layers as it goes
deep into the encoding section. As mentioned earlier, the AE has
encoding and decoding sections; generally, the filters in the decoding
section are the transpose of the encoding filters. This arrangement aims
to preserve the symmetry of the architecture and reduce the parameters
to be trained. AE is an
unsupervised method that creates a latent representation of the input by
trying to regenerate the same input by the decoder and minimize the
error. Stochastic gradient descent (SGD) is widely used to optimize the
filters.
FIGURE 6 Typical autoencoder architecture. The first half is the
encoding section resulting in a bottleneck layer (middle section),
followed by a decoding section which is mostly the inverse of the
encoding section.
With the advancement of CNN, the AE got convolution layers instead of
RBM as the building block. However, the 1D version of the image does not
preserve information of 2D or 3D image layout. The issue is resolved in
a modified architecture named convolutional AE (CAE).
The deep CAE, commonly known as
CAE, is used for object classification as a semi-supervised and
unsupervised technic based on the method with which it is merged. In
both cases, the compressed bottleneck core is calculated and used as the
input to any classification model like MLP with SoftMax classifier, SVM,
or to the clustering method like
Gaussian Mixture Model (GMM), K-means , and self-organizing map (SOM).
Many combinations of multiple methods are widely studied and used as
self-supervised, semi-supervised or unsupervised architectures.
It was noted that decreasing the
number of nodes with depth would be a trade-off between dimensionality
reduction and feature extraction. The rate of which solely depends on
the objective function of the application. When AE is used to regenerate
an image’s missing sections, the bottleneck layer’s specific dimension
is assumed unnecessary.
On the other side, the bottleneck layer is the essence of the data when
AE is used for a dimensionality reduction-based feature extractor. The
optimum size of the bottleneck layer is seen as dataset-dependent and a
hyperparameter. For example, it was chosen 256D for ORL and Yale
Dataset, 60D for MNIST, 100D for Fashion-MNIST, 160D for the COIL20
dataset, and 160D for the Astronomical dataset. The encoder section in
CAE is usually derived from successful CNN models like Alexnet, ResNet
or random convolution layers that inherit the properties of a typical
CNN architecture. The filter size remains 3x3, 5x5, and 7x7, the same as
traditional CNN, while the number of filters and depth remain
hyperparameters.
The variational auto-encoder (VAE) offers better organization of the
latent space than typical AE by adding a loss function with a
regularization term. The standard AE is prone to learn only the identity
function during the training if hidden layers are increased. Denoising
autoencoder (DA) counters this problem by adding noise to the input,
forcing AE to denoise and reconstruct it. The DA offers excellent
denoise capabilities with no distinct difference in filters, as shown in
Figure 7. In the experiment, The filter size was set 7x7, 5x5, and 3x3
over MNIST and CIFAR-10 datasets with max-pooling layers of 2x2.