I'm trying to understand/run the code in Michael Neilsen's Neural Networks and Deep Learning chapter 2, on backpropagation: http://neuralnetworksanddeeplearning.com/chap2.html#the_code_for_backpropagation.
At the start of the backward pass, it has:
delta = self.cost_derivative(activations[-1], y) * \
sigmoid_prime(zs[-1])
nabla_b[-1] = delta
nabla_w[-1] = np.dot(delta, activations[-2].transpose())
The forward pass creates the activations
list, where activations[i]
contains a vector of the activations of the neurons in layer i
. So activations[-1]
is the last layer. y
is the desired output.
cost_derivative
is defined as:
def cost_derivative(self, output_activations, y):
"""Return the vector of partial derivatives \partial C_x /
\partial a for the output activations."""
return (output_activations-y)
So that first line outputs a vector with the same shape as our output layer. So my question is how is that np.dot
on the 4th line supposed to work? My understanding is that activations[-2]
is a vector of the activations of the neurons in the 2nd-to-last layer, which can have any number of neurons, so I'm not sure how we can dot product it (or its transpose) with the delta, which has the shape of the output layer.
I ran the code (https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/src/network.py) with some added debug lines to try to understand this, and it doesn't seem to work:
>>> from network import *; net = Network([2,1,2])
>>> net.backprop([1,2], [3,4])
Activations[0]
[1, 2]
Activations[1]
[[ 0.33579893]]
Activations[2]
[[ 0.37944698]
[ 0.45005939]]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<snip>/neural-networks-and-deep-learning/src/network.py", line 117, in backprop
nabla_w[-1] = np.dot(delta, activations[-2].transpose())
ValueError: shapes (2,2) and (1,1) not aligned: 2 (dim 1) != 1 (dim 0)
activations
looks exactly as I'd expect - 2 activations, then 1, then 2. The failure is on the line I'm unclear about, and fails as I'd expect. But, presumably the code in this book is tested (the book is excellent) and I must be doing something wrong. I was writing an independent implementation and hit the same issue, so I was expecting to be able to take this code apart to figure it out - but I can't figure out how this is supposed to work, or why it works for the author.
I'd appreciate any insight on what I'm missing here. Thanks! :)
Suppose the network architecture is [...,N,M]
, that is the last layer outputs the vector of size M
, the one before of size N
(let's focus on the last two layers and ignore the rest). N
and M
can be arbitrary. Also, let's ignore batching, as in your question: we are feeding exactly one input and one label.
In this case, the last weight matrix, i.e. self.weights[-1]
, will have [M,N]
shape, and so must be the nabla_w[-1]
to perform the update correctly. Now:
delta
will have [M,1]
shape (corresponds to the output).activations[-2]
will have [N,1]
shape, hence the transpose is [1,N]
.[M,1]*[1,N] -> [M,N]
shape, which is exactly what we need.Because in numpy the shape (2,)
is not the same as [1,2]
or [2,1]
:
>>> np.array([1, 2]).shape
(2,)
The network architecture distinguishes the rows and columns of both x
and y
, and you have to provide the correct shape of both for it to work. Otherwise, you'll get unexpected broadcasting and shape mismatch. Try this example to see it in action:
net = Network([2,1,2])
x = np.array([1, 2]).reshape([2, 1]) # one example of size 2
y = np.array([3, 4]).reshape([2, 1]) # one example of size 2
net.backprop(x, y)