machine-learning octave gradient-descent

Implementing gradient descent for multiple variables in Octave using "sum"

I'm doing Andrew Ng's course on Machine Learning and I'm trying to wrap my head around the vectorised implementation of gradient descent for multiple variables which is an optional exercise in the course.

This is the algorithm in question (taken from here):

I just cannot do this in octave using sum though, I'm not sure how to multiply the sum of the hypothesis of x(i) - y(i) by the all variables xj(i). I tried different iterations of the following code but to no avail (either the dimensions are not right or the answer is wrong):

theta = theta - alpha/m * sum(X * theta - y) * X;

The correct answer, however, is entirely non-obvious (to a linear algebra beginner like me anyway, from here):

theta = theta - (alpha/m *  (X * theta-y)' * X)';

Is there a rule of thumb for cases where sum is involved that governs transformations like the above?

And if so, is there the opposite version of the above (i.e. going from a sum based solution to a general multiplication one) as I was able to come up with a correct implementation using sum for gradient descent for a single variable (albeit not a very elegant one):

temp0 = theta(1) - (alpha/m * sum(X * theta - y));
temp1 = theta(2) - (alpha/m * sum((X * theta - y)' * X(:, 2)));

theta(1) = temp0;
theta(2) = temp1;

Please note that this only concerns vectorised implementations and although there are several questions on SO as to how this is done, my question is primarily concerned with the implementation of the algorithm in Octave using sum.

Solution

The general "rule of the thumb" is as follows, if you encounter something in the form of

SUM_i f(x_i, y_i, ...) g(a_i, b_i, ...)

then you can easily vectorize it (and this is what is done in the above) through

f(x, y, ...)' * g(a, b, ...)

As this is just a typical dot product, which in mathematics (in Euclidean space of finite dimension) looks like

<A, B> = SUM_i A_i B_i = A'B

thus

(X * theta-y)' * X)

is just

<X * theta-y), X> = <H_theta(X) - y, X> = SUM_i (H_theta(X_i) - y_i) X_i

as you can see this works both ways, as this is just a mathematical definition of dot product.