Koray et al. applied sparse coding as a convolution filter bank learning mechanism (dictionary). The learned filters predict quasi-sparse features. The study compared the patch-based sparse coding model with the convolutional sparse coding model, and generated filters are shown in Figures 8 and 9. The filters generated by convolutional sparse coding reduced redundancy between feature vectors at nearby locations and increased overall efficiency. The size of the convolutional filter bank was calculated using a formulation-based convolution rather than the input-dependent dictionary size. A formulation-based convolution was used over the traditional convolution approach to reduce the complexity. The filter size was selected as 9x9x64 and 9x9x256 for the first and second layers, respectively. However, no concrete arguments are made for particular sizes and numbers. It was claimed to have a better number of learned filters over convolutional RBM and traditional sparse coding.
FIGURE 9 Second-stage filters of convolutional sparse coding
4.2 | Clustering
Clustering methods aim at grouping the data points that possess ”likeliness” measured by similarity measurement. Even though the groups have labels or not, grouping the ”similar” data points is the core concept of clustering algorithms. The ”similarity” can be defined in many ways. Mainly distance-based or probabilistic partitional-based methods are used for image clustering. The distance-based methods mostly use Euclidian or cosine measures, and probabilistic methods use probability scores in decision-making. The widely used performance criteria for cluster assignment are intra-cluster compactness and inter-cluster separability. The goal is to minimize cluster compactness and maximize the distance between clusters. Compared to supervised methods, clustering methods require very little domain knowledge. In semi-supervised architecture, mainly supervised algorithms are used for feature learning, followed by clustering methods for grouping the objects (as an alternative to the supervised classification method). However, it was noted that training convolutional filters using clustering techniques can be promising and obtain general-purpose visual features. A few distance-based methods have been applied to feature extraction as filters, and such studies are briefly discussed here.
4.2.1 | K-means
K-means is a commonly adopted clustering algorithm due to its simplicity. The fundamental concept is to find the centroids that minimize the distance between the points and the nearest centroids of the clusters in Euclidean space. The number of clusters (K) is the main hyperparameter needed to be defined initially. K-means as a learning module (feature Learning) can lead to excellent results; however, changes are required depending on the variety of datasets and objective functions.
Adam and Andrew used K-means to obtain a dictionary of linear filters. The filter size was chosen 6x6 over input images of 32x32 and convolved with the stride of 1. The number of filters (K1, K2, and K3) for the three layers was chosen as 1600, 3200, and 3200, respectively. The experiment was focused more on selecting local receptive fields, and no discussion was found on determining the number of clusters over the three layers. In a different approach, when the K-means clustering was applied to images for feature learning, the data points were considered pixels or image patches and centroids as the filters. The patch size was set the same as the number of centroids. In the experiment, the patch size is a hyperparameter and was selected 16x16; hence, the centroids dictionary was set to 256. The patches were selected randomly from the input, and the number of selections was around 10000. It was treated as a hyperparameter; however, random patch selection can be avoided now with large datasets available. K-means centroids efficiently detect low-frequency edges but perform poorly in the recognition task. As a solution, the whitening of images was performed before the filters’ training, as whitening tends to generate more orthogonal centroids.
A. Dundar et al. compared classical K-means with convolutional K-means as feature learning filters. Figure 10 shows the filters learned via classical k-means and convolutional K-means. The filters (highlighted red boxes) in classical K-means are likely shifted versions of each other, creating many centroids with similar orientations and generating redundant feature maps. The widely noted issue with classical K-means is a decrement in efficiency with increasing input dimensions. Even for small images, the patch size directly affects the quality of learning by K-means filters. The patch size beyond some point results in poor performance, and the optimum size (taken 6x6 or 8x8) remains a hyperparameter. The depth of the model is directly proportional to the number of trainable parameters. As a solution, a random selection of patches is observed as a widely accepted solution.