Gradient descent algorithm is given as :
(taken from Andres NG coursera course) How should this algorithm be implemented if there are more than 2 theta parameters (feature weights) ?
Should an extra theta value be included :
and repeat until convergence, in other words, until theta0, theta1, theta2 no longer change ?
Maybe convert theta to matrix notation then
big theta = big theta - alpha/m * sigma(h(big theta(X) - Y) * X .
Andrew Ng's notation is to make it clear to those less comfortable with matrix notation - which i doubt includes yourself. –
The matrix formulation - a single equation instead of many ones - may be more clear than the serially/individually depicted equations from the OP. The single matrix formulation shows that effectively the update is an atomic operation across all vectors in the design matrix. It is the responsibility of the underlying linear algebra library to make that "happen" .