So I know that backpropagation uses the gradients and passes them back through the neural network to update the weights. But how exactly are the weights updated for the layers in the middle. Do the non-output layers use the same gradients that the output layers use, or do they have different gradients to update the weights?
In each layer, your neural network produces the following intermediary calculations over and over again:
z[l] = a[l] * w[l] + b[l]
a[l+1] = f[l](z[l])
where a[0] is your original input to the neural network. z[l] is the weighted sum of the previous activation a[l] (remember a[0] is the input), and the weights for the l-th layer w[l] then adds b[l] which is the bias vector. The activation for the current layer is then computed by applying the activation function (f[l](x) - remember you can have different activation functions for each layer) on the previously computed z. This is your forward pass. You repeat the above steps over and over again, for as many layers as you have.
For propagating the errors backwards, you have to differentiate the cost function with respect to each of the weight matrices starting from the last towards the first:
dw[l], dw[l-1], ..., dw[1], dw[0]
Let's take a toy example. You have a neural network with a single hidden layer and an output layer, thus you have a[0] for your inputs, z[0] as the weighted some on the hidden layer, a[1] as the activation on the hidden layer, z[1] as the weighted some on the output layer, a[2] as the network's guess for the input. You also have w[0] for the weights on the hidden layer, and w[1] for the weights for the output layer. Finally of course there is b[0] for hidden, and b[1] for output biases.
Now, to update your weights you have to find:
dE/dw[1]
dE/dw[0]
The first being the weights on the output layer, the next being the weights on the hidden layer.
dE/dw[0] := dE/da[2] * da[2]/dz[1] * dz[1]/dw[1]
If, E := np.mean(np.sum(.5 * (a[2] - Y)**2, axis=1), axis=0)
then:
dE/da[2] = (a[2] - Y)
For da[2]/dz[1], remember that a[l+1] = f[l](z[1]):
da[2]/dz[1] = df[l](z[1])
Finally:
dz[1]/dw[1] = a[1]
dE/dw[0] = (a[2] - Y) * df[l](z[1]) @ a[1]
Where, * is an element-wise multiplication, and @ is the standard well-known matrix multiplication. Now, there are different approaches to initialize the weight matrices and organize the input matrix for mini-batch gradient descent so the above still requires some work. Usually you have to transpose a[1] and/or multiply a[1].T with the rest. But the calculation is like so. Now, for the hidden layer, things just go on:
dE/dw[0] := dE/da[2] * da[2]/dz[1] * dz[1]/da[1] * da[1]/dz[0] * dz[0]/dw[0]
Where dE/da[2] * da[2]/dz[1] is common and is called d[1] a.k.a the delta for the output layer.
dz[1]/da[1] = w[1]
da[1]/dz[0] = df[0](z[0])
dz[0]/dw[0] = a[0]
dE/dw[0] = d[1] @ w[1] * df[0](z[0]) * a[0]
Yet again w[1] and a[0] might need to be transposed and such depending how you design your network, but the calculation is there.
Long story short, it's just the application of the chain-rule over and over again. Given a layer [i] you'll have some delta d[i+1] from the next layer, then you need to compute dw[i] for the current layer, and d[i] for the previous [i-1]-th layer:
d[i] = d[i+1] @ w[i+1] * df[i](z[i])
dw[i] = d[i] @ a[i]
And repeat this for all layers in your network from last to first. I hope that clears it up.