Search code examples
python-3.xpandasnumpynanlogistic-regression

Logistic regression cost function returning nan


I learnt logistic regression recently, and I wanted to practice it. I am currently using this dataset from kaggle. I tried to define a cost function in this manner (I made all necessary imports):

# Defining the hypothesis
sigmoid = lambda x: 1 / (1 + np.exp(-x))
predict = lambda trainset, parameters: sigmoid(trainset @ parameters)

# Defining the cost
def cost(theta):
    #print(X.shape, y.shape, theta.shape)
    preds = predict(X, theta.T)
    errors = (-y * np.log(preds)) - ((1-y)*np.log(1-preds))
    return np.mean(errors)

theta = []
for i in range(13):
    theta.append(1)
theta = np.array([theta])
cost(theta)

and when I run this cell I get:

/opt/venv/lib/python3.7/site-packages/ipykernel_launcher.py:9: RuntimeWarning: divide by zero encountered in log
  if __name__ == '__main__':
/opt/venv/lib/python3.7/site-packages/ipykernel_launcher.py:9: RuntimeWarning: invalid value encountered in multiply
  if __name__ == '__main__':
nan

When I searched online, I got the advice to normalise the data and then try it. So this is how I did it:

df = pd.read_csv("/home/jovyan/work/heart.csv")
df.head()

# The dataset is 303x14 in size (using df.shape)
length = df.shape[0]

# Output vector
y = df['target'].values
y = np.array([y]).T

# We name trainingset as X for convenience
trainingset = df.drop(['target'], axis = 1)
#trainingset = df.insert(0, 'bias', 1)

minmax_normal_trainset = (trainingset - trainingset.min())/(trainingset.max() - trainingset.min())
X = trainingset.values

I really don't know where the division by zero error is occurring and how to fix it. If I made any mistakes in this implementation please correct me. I am sorry if this has been asked before, but all I could find was the tip to normalise the data. Thanks in advance!


Solution

  • np.log(0) raises a divide by zero error. So it's this part that's causing the problems:

    errors = (-y * np.log(preds)) - ((1 - y) * np.log(1 - preds))
                   ##############              #################
    

    preds can be 0 or 1 when the absolute value of x is greater than 709 (because of floating point math, at least on my machine), which is why normalizing x to be between 0 and 1 solves the problem.

    EDIT:

    You may want to normalize to a larger range than (0, 1) - your sigmoid function as currently set is pretty much linear in that range. Maybe use:

     minmax_normal_trainset = c * (trainingset - trainingset.mean())/(trainingset.stdev())
    

    And tune c for better convergence.