5 | SUMMARY
In the late 20th century, handpicked features were extensively studied. Restricted Boltzmann machine and subsequent auto-encoder have been applied directly for feature learning. The PCA, ICA, ZCA, Gabor filter and many statistical methods were applied and observed as promising concepts. After the LeNet-5, backpropagation-trained convolutional NN attracted much attention from the research community, and hybrid structures started to be studied. The lack of computational facilities and rich databases has left many algorithms abandoned for decades and almost forgotten.
Since the promising result of AlexNet in 2012 in the ImageNet competition, hundreds of supervised architectures have been proposed in various sizes and shapes in the last decade. As discussed earlier, they enjoy being different from a structural point of view; however, they use backpropagation as the primary training mechanism with mostly Adam, SGD, and RMSprop optimizer. The gradient calculation is computationally expensive, and gradients also have the issue of getting vanished or exploding. BP also suffers from weight transport problem, which is not believed to be happening in the human brain and is considered biologically implausible. As the model goes deep to achieve the complex objective function, the interpretability becomes vague due to nonlinear activation functions and a large number of nodes in hidden layers. The lack of interpretability leaves a minimal opportunity for significant change, and as a result, the supervised models are almost at the saturation point in terms of improvement. Another reason is the lack of large labeled datasets, which are essential for training but very costly in terms of time, and energy to obtain in the real world. Such models are data-driven; hence, implementation in different fields requires field-specific rich datasets, which is not an optimum solution to the universal problem of true Artificial Intelligence.
However, the concept of convolution is not a dead end; it is proven that convolutional NN is undoubtedly promising over traditional MLP for feature learning on images. However, the limitation lies in the learning mechanism. Various unsupervised architectures have been applied to counter the issue, ranging from probabilistic models to distance-based methods. The advantages of such models are being less complex, faster, and trainable even on a small unlabelled dataset, which is very convenient for broad applications. However, the downside is the efficiency; there is much room for improvement. Another issue with unsupervised learning is labeling the clusters due to the unlabelled training datasets. Self-labeling is also an open question, and many self-supervised and semi-supervised methods have also been proposed as a solution. Few-shot learning and one-shot learning have also given promising results in labeling clusters. For core learning, K-means clustering has been extensively studied and has provided superior efficiency among other clustering and probabilistic methods. However, K-means does not follow the topology of the patches during training. Robustness to the noisy dataset is a crucial requirement in real-world applications, and K-means is prone to noise. To counter these limitations, self-organizing maps (SOM) have been applied. Thanks to neighborhood learning characteristics, SOM preserves the topology of the input data. By that, the distance between nodes in the SOM map reflects those in the high-dimensional input space. However, K-means does a similar thing, but the visualization becomes tricky as the K-means cluster centers are not in a convenient 2D shape.
The SOM has been studied extensively in feature learning with the original form and as convolving filters. However, these models have been more shallow compared to supervised models. The deeper models have been proven to be necessary for considerable learning. In DSOM, EDSOM, and UDSOM, the nodes which have not been BMUs have been dropped either in the selection process or via ReLU. In D-CSNN, the SOM weight vector was traditionally trained, and BMU was calculated using the Hebbian-trained mask layer. The approach was unique and promising. However, it was left with no subsequent experiment. Convolutional sparse-coding also generates good features but lacks sharpness. It is one of the least studied algorithms.
The dimensionality reduction technic was also tried to use as a feature extractor. One of the most studied technics is PCA, in which the orthonormal basis function derived from the covariance matrix can give promising results. However, PCA is a linear technic and underperforms with the non-linearity of real-world datasets. PCA also ignores the low variance, and the importance of features with low variance is still not known for the image dataset. Kernel PCA is an extension of PCA that links nonlinear relations between cluster nodes and high dimensional input; however, it has not been studied. As a solution, auto-encoder, a nonlinear unsupervised dimensionality reduction technic, has been used extensively. Though AE and CAE are more like traditional MLP and CNN-based architecture with decreasing size of hidden layers, they offer no variety for filter design.
As mentioned earlier, the lack of interpretability is choking the improvement of CNN-based models. The CNN was tried to explain as a multi-layer RECOS (REctified-COrrelations on a Sphere) model. It was a supervised approach trained with BP. The Saak and Saab provide insightful interpretations of CNN and novel learning concepts. In both of them, the filters for convolution layers are calculated by PCA. In Saab, selecting a sufficiently large bias term eliminates the requirement of nonlinear activation; however, it may increase the value of weights, and the whole purpose of standardization may vanish. K-means for a pseudo-label generator in the FC layer via linear regression is a novel concept. In Saab, Saab layers were not trained with FC in an end-to-end training fashion. End-to-end learning has produced promising results as a deep cluster and may improve the filters having a feedback loop. It is vital to investigate the labeling of the clusters with minimum use of labeled datasets. Some research has been done on labeling using online algorithms to prevent any human intervention in self-learning. It is not the scope of this paper, but it is very tricky for an algorithm to decide which label to assign if not even a single labeled sample is given. At last, the effect of batch size and epochs have not been noted in this paper as almost negligible studies have been found on them and remain an open question.