matlab machine-learning octave logistic-regression gradient-descent

Matlab Regularized Logistic Regression - how to compute gradient

I am currently taking Machine Learning on the Coursera platform and I am trying to implement Logistic Regression. To implement Logistic Regression, I am using gradient descent to minimize the cost function and I am to write a function called costFunctionReg.m that returns both the cost and the gradient of each parameter evaluated at the current set of parameters.

The problem is better described below:

Problem Description

Problem Description Continued

My cost function is working, but the gradient function is not. Please note that I would prefer to implement this using looping, rather than element-by-element operations.

I am computing theta[0] (in MATLAB, theta(1)) separately as it is not being regularized, i.e. we do not use the first term (with lambda).

function [J, grad] = costFunctionReg(theta, X, y, lambda)
    %COSTFUNCTIONREG Compute cost and gradient for logistic regression with regularization
    %   J = COSTFUNCTIONREG(theta, X, y, lambda) computes the cost of using
    %   theta as the parameter for regularized logistic regression and the
    %   gradient of the cost w.r.t. to the parameters. 

    % Initialize some useful values
    m = length(y); % number of training examples
    n = length(theta); %number of parameters (features)

    % You need to return the following variables correctly 
    J = 0;
    grad = zeros(size(theta));

    % ====================== YOUR CODE HERE ======================
    % Instructions: Compute the cost of a particular choice of theta.
    %               You should set J to the cost.
    %               Compute the partial derivatives and set grad to the partial
    %               derivatives of the cost w.r.t. each parameter in theta

    % ----------------------1. Compute the cost-------------------
    %hypothesis
    h = sigmoid(X * theta);

    for i = 1 : m
        % The cost for the ith term before regularization
        J = J - ( y(i) * log(h(i)) )   -  ( (1 - y(i)) * log(1 - h(i)) );

        % Adding regularization term
        for j = 2 : n
            J = J + (lambda / (2*m) ) * ( theta(j) )^2;
        end            
    end
    J = J/m; 

    % ----------------------2. Compute the gradients-------------------

    %not regularizing theta[0] i.e. theta(1) in matlab

    j = 1;

    for i = 1 : m
        grad(j) = grad(j) + ( h(i) - y(i) ) * X(i,j);
    end

    for j = 2 : n    
        for i = 1 : m
            grad(j) = grad(j) + ( h(i) - y(i) ) * X(i,j) + lambda * theta(j);
        end    
    end

    grad = (1/m) * grad;

    % =============================================================
end

What am I doing wrong?

Solution

The way you are applying regularization is incorrect. You add regularization after you sum over all training examples but instead you are adding regularization after each example. If you left your code as it was before the correction, you are inadvertently making the gradient step larger and will eventually overshoot the solution. This overshooting will accumulate and will inevitably give you a gradient vector of Inf or -Inf for all components (except for the bias term).

Simply put, place your lambda*theta(j) statement after the second for loop terminates:

for j = 2 : n    
    for i = 1 : m
        grad(j) = grad(j) + ( h(i) - y(i) ) * X(i,j); % Change
    end
    grad(j) = grad(j) + lambda * theta(j); % Change
end