I am doing a logistic regression using sklearn but I saw that the fit is actually very steep, going straight from 0 to 1 (see image). Can anyone tell me which parameter should I work on so the fit would be smoother?
X = curve_data[0][0]
y = curve_data[0][1]
clf = LogisticRegression(C=1e5, fit_intercept=True)
clf.fit(X.reshape(-1,1), y)
X_test = np.linspace(0, 1000, 5000)
a_voir = clf.predict(X_test.reshape(-1,1)) #test a effacer
loss = expit(X_test * clf.coef_ + clf.intercept_).ravel()
# midpoint[i] = (logit(0.5)-clf.intercept_)/clf.coef_
plt.figure()
plt.scatter(data_1a_0V_VCASN_48_all[5][0], data_1a_0V_VCASN_48_all[5][1])
plt.plot(X_test, a_voir)
plt.show()
Here is a small example for you, to see what happens when data is not scaled:
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
%matplotlib notebook
import matplotlib.pyplot as plt
np.random.seed(42) # for reproducibility
X = np.random.rand(100, 1) * 1000 # generate a random vector that ranges from 0 to 1000
X_test = np.linspace(0, 1000, 5000).reshape(-1, 1) # generate testing data
y = (X > 500) # generate binary classification labels
y_int = y.astype(int).flatten() # convert to 0 and 1
scaler_X = MinMaxScaler() # scaler
# scaled
X_scaled = scaler_X.fit_transform(X) # scale X
clf_scaled = LogisticRegression(C=1e4, fit_intercept=True) # logistic with scaled
clf_scaled.fit(X_scaled, y_int) # fit logistic with scaled
X_test_scaled = scaler_X.transform(X_test) # scale test data
probabilities_scaled = clf_scaled.predict_proba(X_test_scaled)[:, 1] # get probabilities of test data
a_voir_scaled = probabilities_scaled * (np.max(y_int) - np.min(y_int)) + np.min(y_int) # reverse normalizing
midpoint_scaled = (np.log(0.5) - clf_scaled.intercept_) / clf_scaled.coef_[0] # get scaled midpoint
midpoint_scaled = scaler_X.inverse_transform(midpoint_scaled.reshape(-1, 1)) # get scaled midpoint on unscaled space
# unscaled
clf_unscaled = LogisticRegression(C=1e4, fit_intercept=True) # logistic with unscaled
clf_unscaled.fit(X, y_int) # fit logistic with unsacled
probabilities_unscaled = clf_unscaled.predict_proba(X_test)[:, 1] # get probabilities of test data, unscaled
a_voir_unscaled = probabilities_unscaled * (np.max(y_int) - np.min(y_int)) + np.min(y_int) # reverse normalizing
midpoint_unscaled = (np.log(0.5) - clf_unscaled.intercept_) / clf_unscaled.coef_[0] # get unscaled midpoint
# Plot the original data and the logistic regression curve
plt.scatter(X, y_int, label='Original')
plt.plot(X_test, a_voir_scaled, label='scaled')
plt.plot(X_test, a_voir_unscaled, label='unscaled')
plt.axvline(midpoint_scaled, linestyle = "--", label = "midpoint from scaled")
plt.axvline(midpoint_unscaled, linestyle = "-.", label = "midpoint from unscaled")
plt.grid()
plt.xlabel('X')
plt.ylabel('y')
plt.legend(loc = "upper left", ncols = 1)
plt.show()
It is imperative to scale data unless you have a reason to preserve the variance that can be found inside your variables. You can see from the overfitting of the orange curve that going without any scaling can damage the predictions. Think about scaling as a way to hear the sound of a needle drop in a silent room, as opposed to doing so in a heavy metal concert. Likewise, the fit focuses on high variances found in the data and fails to find fine details. Play around with the C
in both examples and see the difference it makes, you will find that going as low as 100 is also find for the scaled case.
The results: