In other words, what is the main reason from switching the bias to a b_j
or to an additional w_ij*x_i
in the neuron summation formula before the sigmoid? Performance? Which method is the best and why?
Note: j
is a neuron of the actual layer and i
a neuron of a lower layer.
Note: it makes little sense to ask for the best
method here. Those are two different mathematical notations for exactly the same thing.
However, fitting the bias as just another weight allows you to rewrite the sum as a scalar product of an observed feature vector x_d
with the weight vector w
.
Have you tried to calculate the derivate w.r.t w
in order to get the optimal w
according to least squares? You will notice that this calculation becomes much cleaner in a vectorized notation.
Apart from that: In many high level programming languages vectorized calculations are significantly more efficient than the non-vectorized equivalent. So performance is also point, at least in some languages.