matlab matrix machine-learning gradient-descent

Gradient Descent with multiple variable without Matrix

I'm new with Matlab and Machine Learning and I tried to make a gradient descent function without using matrix.

m is the number of example on my training set
n is the number of feature for each example

The function gradientDescentMulti takes 5 arguments:

X mxn Matrix
y m-dimensional vector
theta : n-dimensional vector
alpha : a real number
nb_iters : a real number

I already have a solution using matrix multiplication

function theta = gradientDescentMulti(X, y, theta, alpha, num_iters)
  for iter = 1:num_iters
    gradJ = 1/m * (X'*X*theta - X'*y);
    theta = theta - alpha * gradJ;
  end
end

The result after iterations:

theta =
   1.0e+05 *

    3.3430
    1.0009
    0.0367

But now, I tried to do the same without matrix multiplication, this is the function:

function theta = gradientDescentMulti(X, y, theta, alpha, num_iters)
  m = length(y); % number of training examples
  n = size(X, 2); % number of features

  for iter = 1:num_iters
    new_theta = zeros(1, n);
    %// for each feature, found the new theta
    for t = 1:n
      S = 0;
      for example = 1:m
        h = 0;
        for example_feature = 1:n
          h = h + (theta(example_feature) * X(example, example_feature));
        end
        S = S + ((h - y(example)) * X(example, n)); %// Sum each feature for this example
      end
      new_theta(t) = theta(t) - alpha * (1/m) * S; %// Calculate new theta for this example
    end 
    %// only at the end of the function, update all theta simultaneously
    theta = new_theta'; %// Transpose new_theta (horizontal vector) to theta (vertical vector)
  end
end

The result, all the theta are the same :/

theta =
   1.0e+04 *

    3.5374
    3.5374
    3.5374

Solution

If you look at the gradient update rule, it may be more efficient to actually compute the hypothesis of all of your training examples first, then subtract this with the ground truth value of each training example and store these into an array or vector. Once you do this, you can then compute the update rule very easily. To me, it doesn't appear that you're doing this in your code.

As such, I rewrote the code, but I have a separate array that stores the difference in the hypothesis of each training example and ground truth value. Once I do this, I compute the update rule for each feature separately:

for iter = 1 : num_iters

    %// Compute hypothesis differences with ground truth first
    h = zeros(1, m);
    for t = 1 : m
        %// Compute hypothesis
        for tt = 1 : n
            h(t) = h(t) + theta(tt)*X(t,tt);
        end
        %// Compute difference between hypothesis and ground truth
        h(t) = h(t) - y(t);
    end

    %// Now update parameters
    new_theta = zeros(1, n);    
    %// for each feature, find the new theta
    for tt = 1 : n
        S = 0;
        %// For each sample, compute products of hypothesis difference
        %// and the right feature of the sample and accumulate
        for t = 1 : m
            S = S + h(t)*X(t,tt);
        end

        %// Compute gradient descent step
        new_theta(tt) = theta(tt) - (alpha/m)*S;
    end

    theta = new_theta'; %// Transpose new_theta (horizontal vector) to theta (vertical vector)    

end

When I do this, I get the same answers as using the matrix formulation.