machine-learning octave linear-regression gradient-descent feature-scaling

Linear Regression - Implementing Feature Scaling

I was trying to implement Linear Regression in Octave 5.1.0 on a data set relating the GRE score to the probability of Admission. The data set is of the sort,

337 0.92
324 0.76
316 0.72
322 0.8
.
.
.

My main Program.m file looks like,

     % read the data

  data = load('Admission_Predict.txt');

  % initiate variables
  x = data(:,1);
  y = data(:,2);
  m = length(y);
  theta = zeros(2,1);
  alpha = 0.01;
  iters = 1500;
  J_hist = zeros(iters,1);

  % plot data
  subplot(1,2,1);
  plot(x,y,'rx','MarkerSize', 10);
  title('training data');

  % compute cost function
  x = [ones(m,1), (data(:,1) ./ 300)]; % feature scaling
  J = computeCost(x,y,theta);

  % run gradient descent
  [theta, J_hist] = gradientDescent(x,y,theta,alpha,iters);


  hold on;
  subplot(1,2,1); 
  plot((x(:,2) .* 300), (x*theta),'-');
  xlabel('GRE score');
  ylabel('Probability');
  hold off;

  subplot (1,2,2); 
  plot(1:iters, J_hist, '-b');
  xlabel('no: of iteration');
  ylabel('Cost function');

computeCost.m looks like,

 function J = computeCost(x,y,theta)
  m = length(y);
  h = x * theta;
  J = (1/(2*m))*sum((h-y) .^ 2);
endfunction

and gradientDescent.m looks like,

    function [theta, J_hist] = gradientDescent(x,y,theta,alpha,iters)
  m = length(y);
  J_hist = zeros(iters,1);

  for i=1:iters

    diff = (x*theta - y);
    theta = theta - (alpha * (1/(m))) * (x' * diff);
    J_hist(i) = computeCost(x,y,theta);

  endfor

endfunction

The graphs plotted then looks like this,

Graphs

which you can see, doesn't feel right even though my Cost function seems to be minimized.

Can someone please tell me if this is right? If not, what am I doing wrong?

Solution

The easiest way to check whether your implementation is correct is to compare with a validated implementation of linear regression. I suggest using an alternative implementation approach like the one suggested here, and then comparing your results. If the fits match, then this is the best linear fit to your data and if they don't match, then there may be something wrong in your implementation.