Confused by the notation (a and z) and usage of backpropagation equations used in neural networks gradient decent training

I’m writing a neural network but I have trouble training it using backpropagation so I suspect there is a bug/mathematical mistake somewhere in my code. I’ve spent ours reading different literature on how the equations of backpropagation should look but I’m a bit confused by it since different books say different things, or at least use wildly confusing and contradictory notation. So, I was hoping that someone who knows with a 100% certainty how it works could clear it out for me.

There are two steps in the backpropagation that confuse me. Let’s assume for simplicity that I only have a three layer feed forward net, so we have connections between input-hidden and hidden-output. I call the weighted sum that reaches a node z and the same value but after it has passed the activation function of the node a.

Apparently I’m not allowed to embed an image with the equations that my question concern so I will have to link it like this: https://i.sstatic.net/CvyyK.gif

Now. During backpropagation, when calculating the error in the nodes of the output layer, is it:

[Eq. 1] Delta_output = (output-target) * a_output through the derivative of the activation function

Or is it

[Eq. 2] Delta_output = (output-target) * z_output through the derivative of the activation function

And during the error calculation of the nodes in the hidden layer, same thing, is it:

[Eq. 3] Delta_hidden = a_h through the derivative of the activation function * sum(w_h*Delta_output)

Or is it

[Eq. 4] Delta_hidden = z_h through the derivative of the activation function * sum(w_h*Delta_output)

So the question is basically; when running a node's value through the derivative version of the activation function during backpropagation, should the value be expressed as it was before or after it passed the activation function (z or a)?

Is the first or the second equation in the image correct and similarly is the third or fourth equation in the image correct?

Thanks.

Solution

You have to compute the derivatives with the values before it have passed through the activation function. So the answer is "z".

Some activation functions simplify the computation of the derivative, like tanh:

a = tanh(z)

derivative on z of tanh(z) = 1.0 - tanh(z) * tanh(z) = 1.0 - a * a

This simplification can lead to the confusion you was talking about, but here is another activation function without possible confusion:

a = sin(z)
derivative on z of sin(z) = cos(z)

You can find a list of activation functions and their derivatives on wikipedia: activation function.

Some networks doesn't have an activation function on the output nodes, so the derivative is 1.0, and delta_output = output - target or delta_output = target - output, depending if you add or substract the weight change.

If you are using and activation function on the output nodes, the you'll have to give targets that are in the range of the activation function like [-1,1] for tanh(z).