I was reading tom Mitchell's machine learning book and he mentioned the formula for perceptron training rule is
where
This implies that if is very large then so is
, but I don't understand the purpose of a large update when
is large
on the contrary, I feel like if there is a large then the update should be small since a small fluctuation in
will result in a big change in the final output (due to
)
The adjustments are vector additions and subtractions, which can be thought as rotating a hyperplane such that class 0
falls on one part and class 1
falls on the other part.
Consider a 1xd
weight vector indicating the weights of the perceptron model. Also, consider a
1xd
datapoint . Then the predicted value of the perceptron model, considering a linear threshold without a loss of generality, will be
Here '.' is a dot product, or
The hyperplane above equation is
(Ignoring the iteration indices for the weight updates for simplicity)
Let us consider we have two classes 0
and 1
, again without a loss of generality, datapoints labelled 0
fall on one side where Eq.1 <= 0 of the hyperplane, and the datapoints labelled 1
fall on the other side where Eq.1 > 0.
The vector which is normal to this hyperplane is . The angle between the datapoints with label
0
should be more that 90
degrees and the datapoints between the datapoints with label 1
should be less than 90
degrees.
There are three possibilities of (ignoring the training rate)
1
, but the present set of weights classified it as 0
. The Eq1. 90
degrees, which should have been lesser. The update rule is 90
degrees. 0
, but the present set of weights classified it as 1
. The eq1. 90
degrees, which should have been greater. The update rule is 90
degrees.This is iterated over and overe and the hyperplane is rotated and adjusted so that the angle of the hyperplane's normal has less than 90
degrees with the datapoint with class labeled 1
and greater than 90
degrees with the datapoints of class labelled 0
.
If the magnitude of is huge there will be big changes, and therefore it will cause problems in the process, and may take more iterations to converge depending on the magnitude of the initial weights. Therefore it is a good idea to normalise or standardise the datapoints. From this perspective it is easy to visualise what exactly the update rules are doing visually (consider the bias as a part of the hyperplane Eq.1). Now extend this to more complicated networks and or with thresholds.
Recommended reading and reference: Neural Network, A Systematic Introduction by Raul Rojas: Chapter 4