The code of the partial derivatives of the mean square error:
w_grad = -(2 / n_samples)*(X.T.dot(y_true - y_pred))
b_grad = -(2 / n_samples)*np.sum(y_true - y_pred)
With n_samples
as n, the samples number, y_true
as the observations and y_pred
as the predictions
My question is, why we used the sum for the gradient in the code of b (b_grad
), and why we didn't in the code of w_grad
?
The original equation is:
If you have ten features, then you have ten W
s and ten B
s, and the total number of variables are twenty.
But we can just sum all B_i
into one variable, and the total number of variables becomes 10+1 = 11. It is done by adding one more dimension and fixing the last x
be 1. The calculation becomes below: