I'm trying to implement a back-propagation method for a fully connected layer with arbitrary activation function. I understand the general idea and math behind the algorithm but and i'm having difficulties with understanding the vectorized form...
I need help understanding the expected dimensions of the element
known sizes:
unknown sizes: for N=1 (Number of examples)
Here is my code:
def backward(self, dy):
if self.activator == 'relu':
dz = np.zeros((self.z.shape[0], self.z.shape[1]))
dz[self.z>0] = 1
elif self.activator == 'sigmoid':
dz = self.z * (1 - self.z)
elif self.activator == 'soft-max':
s = self.z.reshape(-1, 1)
dz = np.diagflat(s) - np.dot(s, s.T)
elif self.activator == 'none':
dz = 1
self.d = np.dot((dz * dy), self.W.T) # the error of the layer
self.W_grad = np.dot(self.X.T, dy) # The weight gradient of the layer
self.b_grad = np.sum(dy, axis=0).reshape(1, -1) # The bias gradient of the layer
A couple errors:
self.b
should have size self.b is size (10, )
not (128, 10)
(as bias is per-unit, not per-unit-pair). self.W_grad
should be np.dot(self.X.T, (dz * dy))
, not np.dot(self.X.T, dy)
. Same for self.b_grad
- it should be np.sum(dz * dy, axis=0)
As for the rest
dy := dL/dy
should be (N, 10)
, as it contains the gradient of the loss with respect to each element in y.
dz := df(z)/d(z)
should be (N, 10)
for an elementwise activation function, since dz[i]
contains df(z[i])/dz[i]
.
self.d := dL/dX
should be (N, 128)
because it contains the gradient of the loss with respect to each element in X.