python numpy machine-learning logistic-regression gradient-descent

Logistic regression cost change turns constant

After just a few iterations of gradient descent, the cost function change turns constant which is most definitely how it should not perform:

The initial result of the gradient descent function seems correct, as does the result of the cost function and thereby the hypothesis function so I presume the problem doesn't lie there. Sorry if the question is too undefined but I couldn't narrow the problem down any more myself. If you could explain what is wrong in my program I'd be thankful.

This is how the data looks that I'm using:

34.62365962451697,78.0246928153624,0

30.28671076822607,43.89499752400101,0

35.84740876993872,72.90219802708364,0

60.18259938620976,86.30855209546826,1

79.0327360507101,75.3443764369103,1

45.08327747668339,56.3163717815305,0

And this is the code:

import numpy as np
from matplotlib import pyplot as plt


data = np.genfromtxt("ex2data1.txt", delimiter=",")

X = data[:,0:-1]
X = np.array(X)
m = len(X)
ones = np.ones((m,1))
X = np.hstack((ones,X))

Y = data[:,-1]
Y = np.array(Y)
Y = Y.reshape((m,1))

Cost_History = [[],[]]

def Sigmoid(z):
    G = float(1/float(1+np.exp(-1.0*z)))
    return G

def Hypothesis(theta, x):
    z = np.dot(x,theta)
    return Sigmoid(z)

def Cost_Function(X, Y, theta, m):
    sumOfErrors = 0
    for i in range(m):
        xi = X[i]
        yi = Y[i]
        hi = Hypothesis(theta, xi)
        sumOfErrors += yi*np.log(hi) + (1-yi)*np.log(1-hi)
    const = -(1/m)
    J = const * sumOfErrors
    return J

def Cost_Function_Derivative(X, Y, theta, feature, alpha, m):
    sumErrors = 0
    for i in range(m):
        xi = X[i]
        yi = Y[i]
        hi = Hypothesis(theta, xi)
        error = (hi - yi)*xi[feature]
        sumErrors += error
    constant = float(alpha)/float(m)
    J = constant * sumErrors
    return J

def Gradient_Descent(X, Y, theta, alpha, m):
    new_theta = np.zeros((len(theta),1))
    for feature in range(len(theta)):
        CFDerivative = Cost_Function_Derivative(X, Y, theta, feature, alpha, m)
        new_theta[feature] = theta[feature] - CFDerivative
    return new_theta


def Logistic_Regression(X,Y,alpha, theta, iterations, m):
    for iter in range(iterations):
        theta = Gradient_Descent(X, Y, theta, alpha, m)
        cost = Cost_Function(X, Y, theta, m)
        Cost_History[0].append(cost)
        Cost_History[1].append(iter)
        if iter % 100 == 0:
            print(theta, cost, iter)
    return theta

alpha = 0.001
iterations = 1500

theta = np.zeros((len(X[0]),1))
theta = Logistic_Regression(X, Y, alpha, theta, iterations, m)
print(theta)
test = np.array((1,85,45))
print(Hypothesis(theta, test))

wrong = 0
for i in range(m):
    xi = X[i]
    yi = Y[i]
    hi = Hypothesis(theta, xi)
    if yi != round(hi):
        wrong+=1
print(wrong/m)

plt.plot(Cost_History[1], Cost_History[0], "b")
plt.show()

Solution

From the given plot, it is apparent that the cost is in fact still decreasing. A quick bit of searching reveals that your data can currently be found here, and by running your code for 500000 iterations, what I get is the following, which looks just as you would expect:

After about 20000000 steps, the value of theta becomes [-25.15510086, 0.20618186, 0.20142117]. To ensure that this is what we would expect, we can compare the value with the parameters obtained from using sci-kit learn's implementation with a large value of C:

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(C=1e10, tol=1e-6).fit(X, Y.ravel())
model.coef_[0] + np.array([model.intercept_[0], 0, 0])
# array([-25.16127356,   0.20623123,   0.20147112])

Or the same thing in statsmodels:

from statsmodels.discrete.discrete_model import Logit
model = Logit(Y.ravel(), X)
model.fit().params
# array([-25.16133357,   0.20623171,   0.2014716 ])

Now of course, running the algorithm for that many steps is bound to take a while. In practice, you would typically turn to second order optimization routines or stochastic gradient descent, but as it turns out, most of your code can be phrased in terms of vectorized operations such as matrix multiplications, which will add enough performance to have it converge relatively fast. In particular, your methods can be rewritten as follows, the only difference being that Y no longer has to be reshaped:

def hypothesis(theta, X):
    return 1/(1+np.exp(-np.dot(X, theta)))

def cost_function(X, Y, theta):
    h = hypothesis(theta, X)
    return -np.mean(Y*np.log(h) + (1-Y)*np.log(1-h))

def gradient_descent(X, Y, theta, alpha):
    h = hypothesis(theta, X)
    return theta - alpha*np.dot(X.T, h - Y)/len(X)

def logistic_regression(X, Y, alpha, theta, iterations):
    for iter in range(iterations):
        theta = gradient_descent(X, Y, theta, alpha)     
        if iter % 100000 == 0:
            cost = cost_function(X, Y, theta)
            cost_history[0].append(cost)
            cost_history[1].append(iter)
            print(theta, cost, iter)
    return theta