2 | FILTER INITIALIZATION
Filter initialization is a
crucial factor that impacts the final accuracy more than the learning
algorithm in some cases in feed-forward networks (FNNs). Smaller initial
weights result in smaller gradients which slows down the training.
Larger initial weights can cause saturation or instability at
activation; hence optimal weight initialization is crucial for
preventing the output of any layers from getting exploded or vanishing
through activation and is believed to be a critical factor for speed and
ability to converge. When CNN was proposed, it was common to set the
initial weight as Gaussian noise with 0 mean and standard deviation of
0.01. Over the years, other initialization technics have also been
proposed to prevent exploding, vanishing gradients, and dead neurons The
optimum filter initialization is also seen as an open question with its
relation to training sample labels, architecture, objective function,
and types of outcomes of the algorithm. Three scenarios are commonly
used to initialize the weights.
2.1 | Random initialization
In random initialization, the values are randomly initialized near 0 and
usually follow a normal or uniform distribution. The issue with random
initialization is inconsistency. If the input values are too small, the
convolution process would create a significant difference with epochs
and result in different outcomes. Small values imply slow learning,
prone to local minima, and possible vanishing gradient issue. On the
other hand, large initialization values could saturate the neuron’s
outputs. They also could create exploding gradient issues, which would
result in oscillation around the optimum target or instability
condition.
For deep networks, some alternate versions have been proposed. LeCun
proposed uniform distribution between -2.4/Fi and 2.4/Fi, where Fi is
the number of the input nodes. The reason is to have a similar initial
standard deviation of the weighted sum to be in the same range for all
nodes and ensure they fall in a specific operating region of the
sigmoidal function. In contrast, it establishes a relationship with
certain activation functions. Also, it is only feasible to apply when
all connections sharing similar weights belong to nodes with identical
Fi. The general term for variance could be k/n, where k is a constant
and depends on the activation function, while n is either the number of
input nodes to the weight tensor or the number of input and output nodes
of the weight tensor. Other widely used examples of random
initialization are Xavier/Glorot and He initialization. They use a
normal distribution with mean zero and variance 1/n and 2/n,
respectively. However, in some cases, uniform distributions are also
used. Xavier is simple and sets the activations’ variance the same
across every layer. However, it is not applicable when the activation
function is non-differentiable. He initialization overcomes this
limitation and is widely used with the ReLU activation function, which
is non-differentiable at zero. LeCun, Xavier, and He do not eliminate
the issue of vanishing or exploding gradients but mitigate the problem
to a better extent.
2.2 | Zero or constant initialization
In this type of initialization, all the weights are set to either zero
or a constant value (usually 1). As all the weights are the same, the
activation also results in the same value, which results in symmetric
hidden layers. In the supervised approach, the derivative of the loss
function remains similar for all the nodes in a filter. Distance-based
clustering methods would not benefit from this initialization since the
constant value would only mimic the input values.
2.3 | Pre-trained initialization
Compared to the above initializations, pre-trained initialization is a
reasonably new approach. There are two types of pre-trained weights:
Transfer Learning, in which trained weights from any pre-trained model
are borrowed and used as the initial state before starting new learning
for the current method. Knowledge transfer accelerates the learning and
generalization process. Earhan et al. proved the claim by conducting
comprehensive experiments on existing algorithms with pre-trained
weights over the traditional approach. Pre-trained weights are also used
in the student-teacher method, where a large network is trained
extensively, and the optimized weights are transferred to comparable
lightweight architecture for lighter applications.