Improper cost function outputs for Vectorized Logistic Regression

I'm trying to implement vectorized logistic regression on the Iris dataset. This is the implementation from Andrew Ng's youtube series on deep learning. My best predictions using this method have been 81% accuracy while sklearn's implementation achieves 100% with completely different values for coefficients and bias. Also, I cant seem to get get proper outputs from my cost function. I suspect it is an issue with computing the gradients of the weights and bias with respect to the cost function though in the course he provides all of the necessary equations ( unless there is something in the actual exercise which I don't have access to being left out.) My code is as follows.

n = 4
m = 150

y = y.reshape(1, 150)
X = X.reshape(4, 150)
W = np.zeros((4, 1))
b = np.zeros((1,1))

for epoch in range(1000):
    Z = np.dot(W.T, X) + b
    A = sigmoid(Z)                                           # 1/(1 + e **(-Z))
    J = -1/m * np.sum(y * np.log(A) + (1-y) * (1 - np.log(A)))   #cost function
    dz = A - y
    dw = 1/m * np.dot(X, dz.T)
    db = np.sum(dz)
    W = W - 0.01 * dw 
    b = b - 0.01 * db

    if epoch % 100 == 0:
        print(J)

My output looks something like this.

   -1.6126604413879289
   -1.6185960074767125
   -1.6242504226045396
   -1.6296400635926438
   -1.6347800862216104
   -1.6396845400653066
   -1.6443664703028427
   -1.648838008214648
   -1.653110451818512
   -1.6571943378913891

W and b values are:

  array([[-0.68262679, -1.56816916,  0.12043066,  1.13296948]])
  array([[0.53087131]])

Where as sklearn outputs:

  (array([[ 0.41498833,  1.46129739, -2.26214118, -1.0290951 ]]),
   array([0.26560617]))

I understand sklearn uses L2 regularization but even when turned off it's still far from the correct values. Any help would be appreciated. Thanks

Solution

You are likely getting strange results because you are trying to use logistic regression where y is not a binary choice. Categorizing the iris data is a multiclass problem, y can be one of three values:

> np.unique(iris.target)
> array([0, 1, 2])

The cross entropy cost function expects y to either be one or zero. One way to handle this is the one vs all method.

You can check each class by making y a boolean of whether the iris in in one class or not. For example here you can make y a data set of either class 1 or not:

y = (iris.target == 1).astype(int)

With that your cost function and gradient descent should work, but you'll need to run it multiple times and pick the best score for each example. Andrew Ng's class talks about this method.

EDIT:

It's not clear what you are starting with for data. When I do this, don't reshape the inputs. So you should double check that all your multiplication is delivering the shapes you want. On thing I notice that's a little odd, is the last term in your cost function. I generally do this:

cost = -1/m * np.sum(Y*np.log(A) + (1-Y) * np.log(1-A))

not:

-1/m * np.sum(y * np.log(A) + (1-y) * (1 - np.log(A)))

Here's code that converges for me using the dataset from sklearn:

from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
# Iris is a multiclass problem. Here, just calculate the probabily that
# the class is `iris_class`
iris_class = 0
Y = np.expand_dims((iris.target == iris_class).astype(int), axis=1)
# Y is now a data set of booleans indicating whether the sample is or isn't a member of iris_class

# initialize w and b
W = np.random.randn(4, 1)
b = np.random.randn(1, 1)

a = 0.1              # learning rate
m = Y.shape[0]        # number of samples

def sigmoid(Z):
    return 1/(1 + np.exp(-Z))

for i in range(1000):
    Z = np.dot(X ,W) + b
    A = sigmoid(Z)
    dz = A - Y
    dw = 1/m * np.dot(X.T, dz)
    db = np.mean(dz)
    W -= a * dw
    b -= a * db
    cost = -1/m * np.sum(Y*np.log(A) + (1-Y) * np.log(1-A))

    if i%100 == 0:
        print(cost)