5 | SUMMARY
In the late 20th century, handpicked features were
extensively studied. Restricted Boltzmann machine and subsequent
auto-encoder have been applied directly for feature learning. The PCA,
ICA, ZCA, Gabor filter and many statistical methods were applied and
observed as promising concepts. After the LeNet-5,
backpropagation-trained convolutional NN attracted much attention from
the research community, and hybrid structures started to be studied. The
lack of computational facilities and rich databases has left many
algorithms abandoned for decades and almost forgotten.
Since the promising result of AlexNet in 2012 in the ImageNet
competition, hundreds of supervised architectures have been proposed in
various sizes and shapes in the last decade. As discussed earlier, they
enjoy being different from a structural point of view; however, they use
backpropagation as the primary training mechanism with mostly Adam, SGD,
and RMSprop optimizer. The gradient calculation is computationally
expensive, and gradients also have the issue of getting vanished or
exploding. BP also suffers from weight transport problem, which is not
believed to be happening in the human brain and is considered
biologically implausible. As the model goes deep to achieve the complex
objective function, the interpretability becomes vague due to nonlinear
activation functions and a large number of nodes in hidden layers. The
lack of interpretability leaves a minimal opportunity for significant
change, and as a result, the supervised models are almost at the
saturation point in terms of improvement. Another reason is the lack of
large labeled datasets, which are essential for training but very costly
in terms of time, and energy to obtain in the real world. Such models
are data-driven; hence, implementation in different fields requires
field-specific rich datasets, which is not an optimum solution to the
universal problem of true Artificial Intelligence.
However, the concept of convolution is not a dead end; it is proven that
convolutional NN is undoubtedly promising over traditional MLP for
feature learning on images. However, the limitation lies in the learning
mechanism. Various unsupervised architectures have been applied to
counter the issue, ranging from probabilistic models to distance-based
methods. The advantages of such models are being less complex, faster,
and trainable even on a small unlabelled dataset, which is very
convenient for broad applications. However, the downside is the
efficiency; there is much room for improvement. Another issue with
unsupervised learning is labeling the clusters due to the unlabelled
training datasets. Self-labeling is also an open question, and many
self-supervised and semi-supervised methods have also been proposed as a
solution. Few-shot learning and one-shot learning have also given
promising results in labeling clusters. For core learning, K-means
clustering has been extensively studied and has provided superior
efficiency among other clustering and probabilistic methods. However,
K-means does not follow the topology of the patches during training.
Robustness to the noisy dataset is a crucial requirement in real-world
applications, and K-means is prone to noise. To counter these
limitations, self-organizing maps (SOM) have been applied. Thanks to
neighborhood learning characteristics, SOM preserves the topology of the
input data. By that, the distance between nodes in the SOM map reflects
those in the high-dimensional input space. However, K-means does a
similar thing, but the visualization becomes tricky as the K-means
cluster centers are not in a convenient 2D shape.
The SOM has been studied extensively in feature learning with the
original form and as convolving filters. However, these models have been
more shallow compared to supervised models. The deeper models have been
proven to be necessary for considerable learning. In DSOM, EDSOM, and
UDSOM, the nodes which have not been BMUs have been dropped either in
the selection process or via ReLU. In D-CSNN, the SOM weight vector was
traditionally trained, and BMU was calculated using the Hebbian-trained
mask layer. The approach was unique and promising. However, it was left
with no subsequent experiment. Convolutional sparse-coding also
generates good features but lacks sharpness. It is one of the least
studied algorithms.
The dimensionality reduction technic was also tried to use as a feature
extractor. One of the most studied technics is PCA, in which the
orthonormal basis function derived from the covariance matrix can give
promising results. However, PCA is a linear technic and underperforms
with the non-linearity of real-world datasets. PCA also ignores the low
variance, and the importance of features with low variance is still not
known for the image dataset. Kernel PCA is an extension of PCA that
links nonlinear relations between cluster nodes and high dimensional
input; however, it has not been studied. As a solution, auto-encoder, a
nonlinear unsupervised dimensionality reduction technic, has been used
extensively. Though AE and CAE are more like traditional MLP and
CNN-based architecture with decreasing size of hidden layers, they offer
no variety for filter design.
As mentioned earlier, the lack of interpretability is choking the
improvement of CNN-based models. The CNN was tried to explain as a
multi-layer RECOS (REctified-COrrelations on a Sphere) model. It was a
supervised approach trained with BP. The Saak and Saab provide
insightful interpretations of CNN and novel learning concepts. In both
of them, the filters for convolution layers are calculated by PCA. In
Saab, selecting a sufficiently large bias term eliminates the
requirement of nonlinear activation; however, it may increase the value
of weights, and the whole purpose of standardization may vanish. K-means
for a pseudo-label generator in the FC layer via linear regression is a
novel concept. In Saab, Saab layers were not trained with FC in an
end-to-end training fashion. End-to-end learning has produced promising
results as a deep cluster and may improve the filters having a feedback
loop. It is vital to investigate the labeling of the clusters with
minimum use of labeled datasets. Some research has been done on labeling
using online algorithms to prevent any human intervention in
self-learning. It is not the scope of this paper, but it is very tricky
for an algorithm to decide which label to assign if not even a single
labeled sample is given. At last, the effect of batch size and epochs
have not been noted in this paper as almost negligible studies have been
found on them and remain an open question.