python scikit-learn logistic-regression prediction roc

Logistic Regression Model - Prediction and ROC AUC

I am building a Logistic Regression using statsmodels (statsmodels.api) and would like to understand how to get predictions for the test dataset. This is what I have so far:

x_train_data, x_test_data, y_train_data, y_test_data = train_test_split(X, df[target_var], test_size=0.3)

logit = sm.Logit(
     y_train_data, 
     x_train_data
)

result = logit.fit()
result.summary()

What is the best way to print the predictions for y_train_data and y_test_data for below? I am unsure which Regression metrics to use or to import in this case:

in_sample_pred = result.predict(x_train_data)
out_sample_pred = result.predict(x_test_data)

Also, what's the best way to calculate ROC AUC score and plot it for this Logistic Regression model (through scikit-learn package)?

Thanks

Solution

Maybe your confusion is that Statsmodels Logit is a Logistic Regression model used for classification, and it already predicts a probability, which is to be used in sklearn's roc_auc_score.

To predict based on your x_test_data, all you have to do is:

x_test_predicted = result.predict(x_test_data)

print(x_test_predicted)

I guess if you wanted to have a good grasp of the predictions, you could look at a dataframe:

import pandas as pd 

df_test_predictions = pd.DataFrame({
    'x_test_predicted': x_test_predicted, 
    'y_test': y_test_data 
    })

Then to calculate ROC-AUC, you can do:

from sklearn.metrics import roc_auc_score

score = roc_auc_score(y_test_data, x_test_predicted)
print(score)

Finally, for the plot, refer to this previously answered question. There barebones is the following:

import sklearn.metrics as metrics
import matplotlib.pyplot as plt 

fpr, tpr, threshold = metrics.roc_curve(y_test, preds)

plt.plot(fpr, tpr)