I'm studying backpropagation but I'm not understanding why do we need to divide the 'dW_curr' by 'm'. Every code I see they do that, but why? We only need to make this division if we're using cross entropy as our loss function or every single loss function?
The next code is from https://towardsdatascience.com/lets-code-a-neural-network-in-plain-numpy-ae7e74410795.
def single_layer_backward_propagation(dA_curr, W_curr, b_curr, Z_curr, A_prev, activation="relu"):
m = A_prev.shape[1]
if activation is "relu":
backward_activation_func = relu_backward
elif activation is "sigmoid":
backward_activation_func = sigmoid_backward
else:
raise Exception('Non-supported activation function')
dZ_curr = backward_activation_func(dA_curr, Z_curr)
dW_curr = np.dot(dZ_curr, A_prev.T) / m
db_curr = np.sum(dZ_curr, axis=1, keepdims=True) / m
dA_prev = np.dot(W_curr.T, dZ_curr)
return dA_prev, dW_curr, db_curr
I would highly recommend, if your goal is to understand, then not following engineering tutorial, but rather mathematical derivation, e.g. the one in Simon Haykin's book "Neural networks and learning machines".
To your specific question, the loss is defined as an expectation over your samples. Empirical estimation of such an expectation is just an average. Derivative is a linear operator, meaning that derivative of a mean is a mean of derivatives. What your "divide by m" is, is nothing but said mean of the corresponding derivatives.
It does not matter if your loss is cross entropy or L2, it matters what is the aggregation, if you try to minimise expected value, you almost surely will end up with taking averages (ie. dividing by number of samples) everywhere (unless you are not using Monte Carlo estimators, but this is rather rare).