python python-3.x neural-network backpropagation

Creating a neural network

I am creating a very simple neural network in python. What I ask for though is not any specific code but the general idea of how it works. I understand about the inputs, weights and so on, everything in forward propagation. What I do not understand is the back propagation. It compares the output to the desired output and calculates the error (difference) but how does it change all of the weights to be correct? Especially how can you change the weights to be different (not all be the same)?

Secondly, when you are changing the weights, how do you make it work for multiple inputs rather than just one input or the other?

Finally, what does the bias do and how do you decide what it is? I have heard it is added to the node it is connected to but in the scenario of 1 input, 1 output and 1 bias connected to the output:

Input is 0 Weight between input and output is -17.2 Bias is -1.79 Output is 0.9999999692839459

But how? 0 x -17.2 - 1.79 = -1.79??? That is not 1?

Thanks for your help everyone :)

Edit: Please do not give me links to other sources (e.g. not on stack overflow) because a good answer would help me and anyone reading this more. Thanks!

Solution

Take a look at lineare regression trained by gradient descent. The goal of linear regression is to find a line (for the case of R^1), a linear function, that minimizes the least-squares-difference between a given sample A of pairs {(xi,yi),...,(xn,yn)} and the linear function f(x).

By definition, the function of a line is given by f(x) = m*x + b where m is the slope and b the intersect with the y-axis. The cost function, which represents the squared difference between the function and the sample is c(X,Y)=1/2n*Sum_i_to_n(f(x_i) - y_i)^2 where X and Y are vectors from the sample A.

So how do we achieve this?

Well, it's an unconstrained optimization problem because we want to minimize c(X,Y) over all entries in the sample A. Oh, that's actually the same as with a neural network but the function f(x) is more complex with neural nets.

The algorithm we use to solve this optimization problem is gradient descent which is defined as

x_t+1= x_t - alpha*f'(x_t)

So the value of a parameter x at time t+1 is the the value of this parameter at time t plus some value of alpha > 0, which is often called the step size, times the partial derivative of c(X,Y) in respect to x.

In case of linear regression, our parameters are m and b. In case of a neural net, the parameters are the weights. That's because we want to learn a function that satisfies our goal of minimizing the squared-difference between our function's output and the training input.

For the intuition: The gradient, the vector of partial derivatives of a function, is always pointing to the direction of the steepest ascent of the functions surface. But since we want to minimize the function we want to go in the negative direction of the gradient. That way we go downward stepwise by the value of alpha while updating the values of our parameters with every step until we reach a minima. This is reached when the gradient c'(X,Y) is equal or almost equal to zero.

So in our example, we build the partial derivatives of c(X,Y) in respect to m and b and write a few lines of code to get it running.

As I said before, it's identical with the training/learing of a neural net. But with neural nets, we have a cascade of depending parameters. So, for example, the weights of a hidden layer depends on the weights of the output layer during gradient descent. So you'll always have a cascade of partial derivaties. And that's where the chain rule is very useful.

Another difference between linear regression and neural nets is that the function c(X,Y) is non-convex with neural nets where it's convex with linear regression. That's because of the properties of the underlaying functions f(x). Hence, when a function is convex, a local minima is always a global one. That's why you can never tell if you have an optimal solution with a neural net.