machine-learning neural-network perceptron

Perceptron training rule, why multiply by x

I was reading tom Mitchell's machine learning book and he mentioned the formula for perceptron training rule is

where

: training rate
: expected output
: actual output
: ith input

This implies that if is very large then so is , but I don't understand the purpose of a large update when is large

on the contrary, I feel like if there is a large then the update should be small since a small fluctuation in will result in a big change in the final output (due to )

Solution

The adjustments are vector additions and subtractions, which can be thought as rotating a hyperplane such that class 0 falls on one part and class 1 falls on the other part.

Consider a 1xd weight vector indicating the weights of the perceptron model. Also, consider a 1xd datapoint . Then the predicted value of the perceptron model, considering a linear threshold without a loss of generality, will be

-- Eq. 1

Here '.' is a dot product, or

The hyperplane above equation is

(Ignoring the iteration indices for the weight updates for simplicity)

Let us consider we have two classes 0 and 1, again without a loss of generality, datapoints labelled 0 fall on one side where Eq.1 <= 0 of the hyperplane, and the datapoints labelled 1 fall on the other side where Eq.1 > 0.

The vector which is normal to this hyperplane is . The angle between the datapoints with label 0 should be more that 90 degrees and the datapoints between the datapoints with label 1 should be less than 90 degrees.

There are three possibilities of (ignoring the training rate)

: implying that this example is classified correctly by the present set of weights. Therefore we do not need any changes for the specific datapoint.
implying that the target was 1, but the present set of weights classified it as 0. The Eq1. which was supposed to be . Eq1. in this case is , which indicates that the angle between and is greater that 90 degrees, which should have been lesser. The update rule is . If you imagine a vector addition in 2d, this will rotate the hyperplane so that the angle between and is closer than before and less than 90 degrees.
implying that the target was 0, but the present set of weights classified it as 1. The eq1. which was supposed to be . Eq1. in this case is indicates that the angle between and is lesser that 90 degrees, which should have been greater. The update rule is . Similarly this will rotate the hyperplane so that the angle between and is greater than 90 degrees.

This is iterated over and overe and the hyperplane is rotated and adjusted so that the angle of the hyperplane's normal has less than 90 degrees with the datapoint with class labeled 1 and greater than 90 degrees with the datapoints of class labelled 0.

If the magnitude of is huge there will be big changes, and therefore it will cause problems in the process, and may take more iterations to converge depending on the magnitude of the initial weights. Therefore it is a good idea to normalise or standardise the datapoints. From this perspective it is easy to visualise what exactly the update rules are doing visually (consider the bias as a part of the hyperplane Eq.1). Now extend this to more complicated networks and or with thresholds.

Recommended reading and reference: Neural Network, A Systematic Introduction by Raul Rojas: Chapter 4