I learnt logistic regression recently, and I wanted to practice it. I am currently using this dataset from kaggle. I tried to define a cost function in this manner (I made all necessary imports):
# Defining the hypothesis
sigmoid = lambda x: 1 / (1 + np.exp(-x))
predict = lambda trainset, parameters: sigmoid(trainset @ parameters)
# Defining the cost
def cost(theta):
#print(X.shape, y.shape, theta.shape)
preds = predict(X, theta.T)
errors = (-y * np.log(preds)) - ((1-y)*np.log(1-preds))
return np.mean(errors)
theta = []
for i in range(13):
theta.append(1)
theta = np.array([theta])
cost(theta)
and when I run this cell I get:
/opt/venv/lib/python3.7/site-packages/ipykernel_launcher.py:9: RuntimeWarning: divide by zero encountered in log
if __name__ == '__main__':
/opt/venv/lib/python3.7/site-packages/ipykernel_launcher.py:9: RuntimeWarning: invalid value encountered in multiply
if __name__ == '__main__':
nan
When I searched online, I got the advice to normalise the data and then try it. So this is how I did it:
df = pd.read_csv("/home/jovyan/work/heart.csv")
df.head()
# The dataset is 303x14 in size (using df.shape)
length = df.shape[0]
# Output vector
y = df['target'].values
y = np.array([y]).T
# We name trainingset as X for convenience
trainingset = df.drop(['target'], axis = 1)
#trainingset = df.insert(0, 'bias', 1)
minmax_normal_trainset = (trainingset - trainingset.min())/(trainingset.max() - trainingset.min())
X = trainingset.values
I really don't know where the division by zero error is occurring and how to fix it. If I made any mistakes in this implementation please correct me. I am sorry if this has been asked before, but all I could find was the tip to normalise the data. Thanks in advance!
np.log(0)
raises a divide by zero
error. So it's this part that's causing the problems:
errors = (-y * np.log(preds)) - ((1 - y) * np.log(1 - preds))
############## #################
preds
can be 0 or 1 when the absolute value of x
is greater than 709 (because of floating point math, at least on my machine), which is why normalizing x
to be between 0 and 1 solves the problem.
EDIT:
You may want to normalize to a larger range than (0, 1)
- your sigmoid function as currently set is pretty much linear in that range. Maybe use:
minmax_normal_trainset = c * (trainingset - trainingset.mean())/(trainingset.stdev())
And tune c
for better convergence.