Search code examples
machine-learningneural-networkperceptron

Perceptron training rule, why multiply by x


I was reading tom Mitchell's machine learning book and he mentioned the formula for perceptron training rule is

enter image description here

where

enter image description here

  • enter image description here : training rate
  • enter image description here : expected output
  • enter image description here : actual output
  • enter image description here : ith input

This implies that if enter image description here is very large then so is enter image description here, but I don't understand the purpose of a large update when enter image description here is large

on the contrary, I feel like if there is a large enter image description here then the update should be small since a small fluctuation in enter image description here will result in a big change in the final output (due to enter image description here)


Solution

  • The adjustments are vector additions and subtractions, which can be thought as rotating a hyperplane such that class 0 falls on one part and class 1 falls on the other part.

    Consider a 1xd weight vector enter image description here indicating the weights of the perceptron model. Also, consider a 1xd datapoint enter image description here. Then the predicted value of the perceptron model, considering a linear threshold without a loss of generality, will be

    enter image description here -- Eq. 1

    Here '.' is a dot product, or

    enter image description here

    The hyperplane above equation is

    enter image description here

    (Ignoring the iteration indices for the weight updates for simplicity)

    Let us consider we have two classes 0 and 1, again without a loss of generality, datapoints labelled 0 fall on one side where Eq.1 <= 0 of the hyperplane, and the datapoints labelled 1 fall on the other side where Eq.1 > 0.

    The vector which is normal to this hyperplane is enter image description here. The angle between the datapoints with label 0 should be more that 90 degrees and the datapoints between the datapoints with label 1 should be less than 90 degrees.

    There are three possibilities of enter image description here (ignoring the training rate)

    • enter image description here: implying that this example is classified correctly by the present set of weights. Therefore we do not need any changes for the specific datapoint.
    • enter image description here implying that the target was 1, but the present set of weights classified it as 0. The Eq1. enter image description here which was supposed to be enter image description here. Eq1. in this case is enter image description here, which indicates that the angle between enter image description here and enter image description here is greater that 90 degrees, which should have been lesser. The update rule is enter image description here. If you imagine a vector addition in 2d, this will rotate the hyperplane so that the angle between enter image description here and enter image description here is closer than before and less than 90 degrees.
    • enter image description here implying that the target was 0, but the present set of weights classified it as 1. The eq1. enter image description here which was supposed to be enter image description here. Eq1. in this case is enter image description here indicates that the angle between enter image description here and enter image description here is lesser that 90 degrees, which should have been greater. The update rule is enter image description here. Similarly this will rotate the hyperplane so that the angle between enter image description here and enter image description here is greater than 90 degrees.

    This is iterated over and overe and the hyperplane is rotated and adjusted so that the angle of the hyperplane's normal has less than 90 degrees with the datapoint with class labeled 1 and greater than 90 degrees with the datapoints of class labelled 0.

    If the magnitude of enter image description here is huge there will be big changes, and therefore it will cause problems in the process, and may take more iterations to converge depending on the magnitude of the initial weights. Therefore it is a good idea to normalise or standardise the datapoints. From this perspective it is easy to visualise what exactly the update rules are doing visually (consider the bias as a part of the hyperplane Eq.1). Now extend this to more complicated networks and or with thresholds.

    Recommended reading and reference: Neural Network, A Systematic Introduction by Raul Rojas: Chapter 4