FIGURE
3 VGG-19 (left), ResNet-34(Middle) and ResNet-34 with residual blocks
Numerous implementations of the above algorithms exist, e.g., Barlow
Twins , Knowledge Distillations, learning via ignoring, Fusion of
Quality maps, Dual-Stream convolution-Transformer segmentation
framework, DIFFNet, Few-shot learning, Vision Transformer and many
hybrids models. Most of the semi-supervised learning technics use the
smaller portion of the labeled dataset to get trained and then tuned on
the larger unlabelled dataset or vice-versa. Such hybrid algorithms use
Alexnet, ResNet, VGG, and similar variants, as the backbone
architecture. In such cases, the contribution regarding filter design is
rarely observed.
The filter size and depth of the network are directly proportional to
trainable parameters and the input data size. The experiments from
LeNet-1 to LeNet-4 concluded that the convolution network’s size is
directly related to the training set for the optimum use of datasets.
For LeNet-5, it was noted that larger architecture did not perform any
better with increased complexity and had longer training time. A similar
result was observed for ResNet-152, DenseNet-121, and DenseNet-169-201
versus AlexNet-8. The smaller network trained with a smaller dataset
tends to perform weaker due to insufficient training For AlexNet, it was
noted that the architecture suffered from under-training for 1000
classes due to the small training dataset size, and data augmentation
was used to meet the requirement. To reduce the number of weights,
deeper models like DensNet-100+, ResNet-50+, and similar variants have
been modified, such as connection dropout and skipping layers. However,
the filter sizes in deeper models are the same as discussed in
traditional supervised models and offer no variety. To summarise the
filter size adopted in the discussed architectures, we may conclude that
the widely used filter sizes are 1x1, 3x3, 5x5, and 7x7. The larger
sizes are chosen in initial layers to capture key spatial regions, while
later sections use comparably smaller sizes for higher-level features.
From the learning perspective, they mostly use backpropagation with Adam
and SGD optimizer. Few semi-supervised methods also claimed to use
various unsupervised techniques for fine-tuning.
4 | FILTER
DESIGN IN UNSUPERVISED LEARNING
The drawback of supervised learning is the requirement of
domain-specific large and labeled datasets, which is not often the case
in the real world. Unsupervised learning does not require a labeled
dataset; sometimes, a smaller dataset is enough. One of the most
significant advantages of unsupervised learning is backpropagation-free
training, which is faster and lighter. However, unsupervised training
has two significant challenges: Effective learning and labeling. In the
supervised methods, classes have labels according to the labeled
datasets, while unsupervised methods develop clusters (based on the
underlying structure of the data). However, the developed clusters need
sensible labels, which is a different challenge. Many algorithms are
proposed as self-labeling methods (self-supervised or semi-supervised)
that primarily use a small set of labeled data. Labeling is not the
scope of the current study, and the following discussion focuses on
unsupervised learning.
Unsupervised approaches in
computer vision are mainly probability-based and distance-based
learning. Probability-based methods use likelihood probability scores,
while distance-based methods use ”distance” as a primary deciding
factor. The distance-based clustering technics are widely adopted. A
smaller distance between objects indicates more’ similarity,’ and they
might belong to the same or nearby clusters. There are three significant
distance measuring methods: Cosine similarity, Manhattan distance, and
Euclidean distance. Cosine is based on the angle between vectors (dot
product); the smaller the angle, the more similar the vectors are,
regardless of their magnitude. Euclidean distance is the L2 norm of the
distance between the vectors in a Euclidean space, while Manhattan is
the L1 norm.
In computer vision, unsupervised approaches are used for feature
learning and clustering. The mixture of more than two architectures
would result in a hybrid structure, i.e., when clustering is used after
a typical DCNN section or for feature extraction, followed by a support
vector machine (SVM) or any other classification method. In the later
cases, the clustering method does not contribute to actual feature
extraction, as clusters are formed based on features learned by DCNN.
Authentic unsupervised learning happens when the algorithms are
implemented in their typical nature to detect the object or used as
filters in typical CNN architecture or a mixture of two or three models.
In their traditional form, unsupervised methods do not use convolution
operations for feature extraction. Such methods would go a little
diverse from the focused study; however, studying them is mandatory to
understand their implementations as convolutional layers.
4.1 | Probabilistic Methods
4.1.1 | Restricted Boltzmann Machine (RBM)
The restricted Boltzmann Machine
(RBM) is an energy-based nonlinear probabilistic model. An RBM has a
single visible and hidden layer having bidirected and symmetrical
connections between the hidden and visible layers. It is typically used
for dimensionality reduction, feature learning, modeling, and more. The
RBM as a learning method has been studied extensively in the recent
decade.
The RBM gets trained by the maximum likelihood rule using the
contrastive divergence learning procedure. The pixels or inputs are the
visible units as they are observable. At the same time, the feature
detectors can be described as hidden units that capture the
dependencies. The total number of weights = the number of nodes in the
hidden layer times the number of nodes in the visible layer.
Each pair between the hidden and
visible layer is provided a probability via an energy function
normalized to the sum of all possible pairs. The goal is to minimize the
energy by adjusting the weights and biases for a particular image and
raising the energy for other images.
The number of hidden layers is increased to deepen the model for better
representations. For example, deep belief Net (DBN)is built by stacked
RBM units. Furthermore, changing the depth, connections, and directions
of communications have resulted in many other architectures like the
deep Boltzmann machine (DBM), the shape Boltzmann machine (SBM), and the
deep energy model (DEM). Two-layered sparse RBM was implemented over a
raw natural image dataset. The
filters learned in the first layer were observed to be local, oriented,
and edge-detecting. The first and second layers had 400 and 200 filters,
respectively. A snapshot of some of the learned filters from the first
layer is shown in Figure 4.