I am building a Logistic Regression using statsmodels (statsmodels.api) and would like to understand how to get predictions for the test dataset. This is what I have so far:
x_train_data, x_test_data, y_train_data, y_test_data = train_test_split(X, df[target_var], test_size=0.3)
logit = sm.Logit(
y_train_data,
x_train_data
)
result = logit.fit()
result.summary()
What is the best way to print the predictions for y_train_data and y_test_data for below? I am unsure which Regression metrics to use or to import in this case:
in_sample_pred = result.predict(x_train_data)
out_sample_pred = result.predict(x_test_data)
Also, what's the best way to calculate ROC AUC score and plot it for this Logistic Regression model (through scikit-learn package)?
Thanks
Maybe your confusion is that Statsmodels Logit is a Logistic Regression model used for classification, and it already predicts a probability, which is to be used in sklearn's roc_auc_score.
To predict based on your x_test_data, all you have to do is:
x_test_predicted = result.predict(x_test_data)
print(x_test_predicted)
I guess if you wanted to have a good grasp of the predictions, you could look at a dataframe:
import pandas as pd
df_test_predictions = pd.DataFrame({
'x_test_predicted': x_test_predicted,
'y_test': y_test_data
})
Then to calculate ROC-AUC, you can do:
from sklearn.metrics import roc_auc_score
score = roc_auc_score(y_test_data, x_test_predicted)
print(score)
Finally, for the plot, refer to this previously answered question. There barebones is the following:
import sklearn.metrics as metrics
import matplotlib.pyplot as plt
fpr, tpr, threshold = metrics.roc_curve(y_test, preds)
plt.plot(fpr, tpr)