Search code examples
pythonnumpyneural-networkbackpropagation

Implementing general back-propagation


I'm trying to implement a back-propagation method for a fully connected layer with arbitrary activation function. I understand the general idea and math behind the algorithm but and i'm having difficulties with understanding the vectorized form...

I need help understanding the expected dimensions of the element

known sizes:

  • Input - self.X is size (N,128)
  • Weights - self.W is size (128,10)
  • Biases - self.b is size (128,10)
  • Output - self.y is size (N,10)
  • Linear output (before activation) - self.z is size (N,10)

unknown sizes: for N=1 (Number of examples)

  • dy - The gradient of the next layer - what size should it be?
  • dz - The derivative of the activation function - what size should it be?
  • self.d - The gradient of current layer - what size should it be?

Here is my code:

def backward(self, dy):
    if self.activator == 'relu':
        dz = np.zeros((self.z.shape[0], self.z.shape[1]))
        dz[self.z>0] = 1
    elif self.activator == 'sigmoid':
        dz = self.z * (1 - self.z)
    elif self.activator == 'soft-max':
        s = self.z.reshape(-1, 1)
        dz = np.diagflat(s) - np.dot(s, s.T)
    elif self.activator == 'none':
        dz = 1

    self.d = np.dot((dz * dy), self.W.T) # the error of the layer
    self.W_grad = np.dot(self.X.T, dy) # The weight gradient of the layer
    self.b_grad = np.sum(dy, axis=0).reshape(1, -1) # The bias gradient of the layer

Solution

  • A couple errors:

    • self.b should have size self.b is size (10, ) not (128, 10) (as bias is per-unit, not per-unit-pair).
    • self.W_grad should be np.dot(self.X.T, (dz * dy)), not np.dot(self.X.T, dy). Same for self.b_grad - it should be np.sum(dz * dy, axis=0)

    As for the rest

    dy := dL/dy should be (N, 10), as it contains the gradient of the loss with respect to each element in y.

    dz := df(z)/d(z) should be (N, 10) for an elementwise activation function, since dz[i] contains df(z[i])/dz[i].

    self.d := dL/dX should be (N, 128) because it contains the gradient of the loss with respect to each element in X.