I'm trying to implement the backpropagation with ReLU as the activation function. If I am not mistaken the derivative of that function is 1 for x > 0 and 0 for x < 0. Using this derivative the network does not learn at all. Searching for other examples I found that most ignore the 1 for X > 0 part and just leave it at x, which leads to much better results. I wonder why that is the case.
To make sure I there are no other mistakes, here is the code for training a 1 input, 1 output no hidden Neuron Network. I use the mean squared error as a error function
import random
x = random.uniform(0, 1)
y = random.uniform(0, 1)
w = random.uniform(0, 1)
lr = 0.1
for i in range(500):
z = x * w
yP = z
if yP < 0:
yP = 0
loss = (yP - y)**2
print(i, loss)
grad_y=2.0*(yP - y)
grad_z = grad_y
if z < 0:
grad_z = 0
else :
grad_z = grad_y
grad_w = grad_z * x
w -= lr * grad_w
Please note that is it unlikely it has to do with the size of the network I also tested on a network with 1000 input Neurons, 1 hidden Layer with 100 Neurons and 10 Output Neurons. I used a batch size of 64 and 500 epochs. It had the same problem.
I just realised what a silly mistake I had. According to the chain rule grad_y should be multiplied with the derivative of ReLU at h which is 0 or 1. This is of course equivalent to just setting it to 0 if the derivative is 0.