I'm trying to do Logistic Regression from Coursera in Julia, but it doesn't work.
The Julia code to calculate the Gradient:
sigmoid(z) = 1 / (1 + e ^ -z)
hypotesis(theta, x) = sigmoid(scalar(theta' * x))
function gradient(theta, x, y)
(m, n) = size(x)
h = [hypotesis(theta, x[i,:]') for i in 1:m]
g = Array(Float64, n, 1)
for j in 1:n
g[j] = sum([(h[i] - y[i]) * x[i, j] for i in 1:m])
end
g
end
If this gradient used it produces the wrong results. Can't figure out why, the code seems like the right one.
The full Julia script. In this script the optimal Theta calculated using my Gradient Descent implementation and using the built-in Optim package, and the results are different.
I tried comparing gradient()
in the OP's code with numerical derivative of cost_j()
(which is the objective function of minimization) using the following routine
function grad_num( theta, x, y )
g = zeros( 3 )
eps = 1.0e-6
disp = zeros( 3 )
for k = 1:3
disp[:] = theta[:]
disp[ k ]= theta[ k ] + eps
plus = cost_j( disp, x, y )
disp[ k ]= theta[ k ] - eps
minus = cost_j( disp, x, y )
g[ k ] = ( plus - minus ) / ( 2.0 * eps )
end
return g
end
But the gradient values obtained from the two routines do no seem to agree very well (at least for the initial stage of minimization)... So I manually derived the gradient of cost_j( theta, x, y )
, from which it seems that the division by m
is missing:
#/ OP's code
# g[j] = sum( [ (h[i] - y[i]) * x[i, j] for i in 1:m ] )
#/ modified code
g[j] = sum( [ (h[i] - y[i]) * x[i, j] for i in 1:m ] ) / m
Because I am not very sure if the above code and expression are really correct, could you check them by yourself...?
But in fact, regardless of whether I use the original or corrected gradients, the program converges to the same minimum value (0.2034977016, almost the same as obtained from Optim), because the two gradients differ only by a multiplicative factor! Because the convergence was very slow, I also modified the stepsize alpha
adaptively following the suggestion by Vincent (here I used more moderate values for acceleration/deceleration):
function gradient_descent(x, y, theta, alpha, n_iterations)
...
c = cost_j( theta, x, y )
for i = 1:n_iterations
c_prev = c
c = cost_j( theta, x, y )
if c - c_prev < 0.0
alpha *= 1.01
else
alpha /= 1.05
end
theta[:] = theta - alpha * gradient(theta, x, y)
end
...
end
and called this routine as
optimal_theta = gradient_descent( x, y, [0 0 0]', 1.5e-3, 10^7 )[ 1 ]
The variation of cost_j
versus iteration steps is plotted below.