FIGURE 3 VGG-19 (left), ResNet-34(Middle) and ResNet-34 with residual blocks
Numerous implementations of the above algorithms exist, e.g., Barlow Twins , Knowledge Distillations, learning via ignoring, Fusion of Quality maps, Dual-Stream convolution-Transformer segmentation framework, DIFFNet, Few-shot learning, Vision Transformer and many hybrids models. Most of the semi-supervised learning technics use the smaller portion of the labeled dataset to get trained and then tuned on the larger unlabelled dataset or vice-versa. Such hybrid algorithms use Alexnet, ResNet, VGG, and similar variants, as the backbone architecture. In such cases, the contribution regarding filter design is rarely observed.
The filter size and depth of the network are directly proportional to trainable parameters and the input data size. The experiments from LeNet-1 to LeNet-4 concluded that the convolution network’s size is directly related to the training set for the optimum use of datasets. For LeNet-5, it was noted that larger architecture did not perform any better with increased complexity and had longer training time. A similar result was observed for ResNet-152, DenseNet-121, and DenseNet-169-201 versus AlexNet-8. The smaller network trained with a smaller dataset tends to perform weaker due to insufficient training For AlexNet, it was noted that the architecture suffered from under-training for 1000 classes due to the small training dataset size, and data augmentation was used to meet the requirement. To reduce the number of weights, deeper models like DensNet-100+, ResNet-50+, and similar variants have been modified, such as connection dropout and skipping layers. However, the filter sizes in deeper models are the same as discussed in traditional supervised models and offer no variety. To summarise the filter size adopted in the discussed architectures, we may conclude that the widely used filter sizes are 1x1, 3x3, 5x5, and 7x7. The larger sizes are chosen in initial layers to capture key spatial regions, while later sections use comparably smaller sizes for higher-level features. From the learning perspective, they mostly use backpropagation with Adam and SGD optimizer. Few semi-supervised methods also claimed to use various unsupervised techniques for fine-tuning.
4 | FILTER DESIGN IN UNSUPERVISED LEARNING
The drawback of supervised learning is the requirement of domain-specific large and labeled datasets, which is not often the case in the real world. Unsupervised learning does not require a labeled dataset; sometimes, a smaller dataset is enough. One of the most significant advantages of unsupervised learning is backpropagation-free training, which is faster and lighter. However, unsupervised training has two significant challenges: Effective learning and labeling. In the supervised methods, classes have labels according to the labeled datasets, while unsupervised methods develop clusters (based on the underlying structure of the data). However, the developed clusters need sensible labels, which is a different challenge. Many algorithms are proposed as self-labeling methods (self-supervised or semi-supervised) that primarily use a small set of labeled data. Labeling is not the scope of the current study, and the following discussion focuses on unsupervised learning. Unsupervised approaches in computer vision are mainly probability-based and distance-based learning. Probability-based methods use likelihood probability scores, while distance-based methods use ”distance” as a primary deciding factor. The distance-based clustering technics are widely adopted. A smaller distance between objects indicates more’ similarity,’ and they might belong to the same or nearby clusters. There are three significant distance measuring methods: Cosine similarity, Manhattan distance, and Euclidean distance. Cosine is based on the angle between vectors (dot product); the smaller the angle, the more similar the vectors are, regardless of their magnitude. Euclidean distance is the L2 norm of the distance between the vectors in a Euclidean space, while Manhattan is the L1 norm.
In computer vision, unsupervised approaches are used for feature learning and clustering. The mixture of more than two architectures would result in a hybrid structure, i.e., when clustering is used after a typical DCNN section or for feature extraction, followed by a support vector machine (SVM) or any other classification method. In the later cases, the clustering method does not contribute to actual feature extraction, as clusters are formed based on features learned by DCNN. Authentic unsupervised learning happens when the algorithms are implemented in their typical nature to detect the object or used as filters in typical CNN architecture or a mixture of two or three models. In their traditional form, unsupervised methods do not use convolution operations for feature extraction. Such methods would go a little diverse from the focused study; however, studying them is mandatory to understand their implementations as convolutional layers.
4.1 | Probabilistic Methods
4.1.1 | Restricted Boltzmann Machine (RBM)
The restricted Boltzmann Machine (RBM) is an energy-based nonlinear probabilistic model. An RBM has a single visible and hidden layer having bidirected and symmetrical connections between the hidden and visible layers. It is typically used for dimensionality reduction, feature learning, modeling, and more. The RBM as a learning method has been studied extensively in the recent decade.
The RBM gets trained by the maximum likelihood rule using the contrastive divergence learning procedure. The pixels or inputs are the visible units as they are observable. At the same time, the feature detectors can be described as hidden units that capture the dependencies. The total number of weights = the number of nodes in the hidden layer times the number of nodes in the visible layer. Each pair between the hidden and visible layer is provided a probability via an energy function normalized to the sum of all possible pairs. The goal is to minimize the energy by adjusting the weights and biases for a particular image and raising the energy for other images.
The number of hidden layers is increased to deepen the model for better representations. For example, deep belief Net (DBN)is built by stacked RBM units. Furthermore, changing the depth, connections, and directions of communications have resulted in many other architectures like the deep Boltzmann machine (DBM), the shape Boltzmann machine (SBM), and the deep energy model (DEM). Two-layered sparse RBM was implemented over a raw natural image dataset. The filters learned in the first layer were observed to be local, oriented, and edge-detecting. The first and second layers had 400 and 200 filters, respectively. A snapshot of some of the learned filters from the first layer is shown in Figure 4.