machine-learning gradient derivative mean-square-error

Why we used the sum in the code for the gradient of the bias and why we didn't in the code of `the weight?

The code of the partial derivatives of the mean square error:

w_grad = -(2 / n_samples)*(X.T.dot(y_true - y_pred))
b_grad = -(2 / n_samples)*np.sum(y_true - y_pred)

With n_samples as n, the samples number, y_true as the observations and y_pred as the predictions

My question is, why we used the sum for the gradient in the code of b (b_grad), and why we didn't in the code of w_grad?

Solution

The original equation is:

$\vec y=\sum_{i=0}^{n}w_ix_i+b_i$

If you have ten features, then you have ten Ws and ten Bs, and the total number of variables are twenty.

But we can just sum all B_i into one variable, and the total number of variables becomes 10+1 = 11. It is done by adding one more dimension and fixing the last x be 1. The calculation becomes below:

$\vec y=\begin{bmatrix}w_0\\\vdots\\w_i\\w_b\end{bmatrix}\begin{bmatrix}x_0\ \hdots\ x_i\1\end{bmatrix}$