python machine-learning deep-learning relu

Using the Proper ReLU derivative prevents learning

I'm trying to implement the backpropagation with ReLU as the activation function. If I am not mistaken the derivative of that function is 1 for x > 0 and 0 for x < 0. Using this derivative the network does not learn at all. Searching for other examples I found that most ignore the 1 for X > 0 part and just leave it at x, which leads to much better results. I wonder why that is the case.

To make sure I there are no other mistakes, here is the code for training a 1 input, 1 output no hidden Neuron Network. I use the mean squared error as a error function

import random

x = random.uniform(0, 1)
y = random.uniform(0, 1)
w = random.uniform(0, 1)
lr = 0.1

for i in range(500):
    z = x * w
    yP = z
    if yP < 0:
        yP = 0
    loss = (yP - y)**2
    print(i, loss)

    grad_y=2.0*(yP - y)
    grad_z = grad_y
    if z < 0:
        grad_z = 0
    else :
        grad_z = grad_y
    grad_w = grad_z * x
    w -= lr * grad_w

Please note that is it unlikely it has to do with the size of the network I also tested on a network with 1000 input Neurons, 1 hidden Layer with 100 Neurons and 10 Output Neurons. I used a batch size of 64 and 500 epochs. It had the same problem.

Solution

I just realised what a silly mistake I had. According to the chain rule grad_y should be multiplied with the derivative of ReLU at h which is 0 or 1. This is of course equivalent to just setting it to 0 if the derivative is 0.