I have a classification problem in which I would like to identify prospective borrowers which should not be invited for a meeting at a bank. In the data, ca. 25% of the borrowers should not be invited. I have around 4500 observations and 86 features (many dummies).
After cleaning the data, I do:
# Separate X_train and Y_train
X = ratings_prepared[:, :-1]
y= ratings_prepared[:,-1]
##################################################################################
# Separate test and train (stratified, 20% test)
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone
skfolds = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_index, test_index in skfolds.split(X,y):
X_train = X[train_index]
y_train = y[train_index]
X_test = X[test_index]
y_test = y[test_index]
Then, I proceed to the training the models. A SGD classifier does not work very well:
def plot_roc_curve(fpr, tpr, label=None):
plt.plot(fpr, tpr, linewidth=2, label =label)
plt.plot([0,1], [0,1],'k--')
plt.axis([0,1,0,1])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1],"b--", label="Precision")
plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
plt.xlabel("Threashold")
plt.legend(loc="center left")
plt.ylim([0,1])
############################# Train Models #############################
from sklearn.linear_model import SGDClassifier
sgd_clf =SGDClassifier(random_state=42)
sgd_clf.fit(X_train,y_train)
y_pred = sgd_clf.predict(X_train)
# f1 score
f1_score(y_train, y_pred)
# confusion matrix
tn, fp, fn, tp = confusion_matrix(y_train, y_pred).ravel()
(tn, fp, fn, tp)
from sklearn.metrics import plot_confusion_matrix
import matplotlib.pyplot as plt
disp = plot_confusion_matrix(sgd_clf, X_train, y_train,
cmap=plt.cm.Blues,
normalize='true')
# recall and precision
from sklearn.metrics import precision_score, recall_score
precision_score(y_train, y_pred)
# Precision Recall
from sklearn.metrics import precision_recall_curve
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.show()
# Plot ROC curve
y_scores = cross_val_predict(sgd_clf, X_train, y_train, cv=3, method="decision_function")
fpr, tpr, thresholds = roc_curve(y_train, y_scores)
plot_roc_curve(fpr, tpr)
plt.show()
# recall and precision
from sklearn.metrics import precision_score, recall_score
precision_score(y_train, y_pred)
### Precision score: 0.5084427767354597
I then move on to a random forest classifier, which should improve upon the SGD
from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train, cv=3, method='predict_proba')
y_scores_forest = y_probas_forest[:,1]
fpr_forest, tpr_forest, threshold_forest = roc_curve(y_train,y_scores_forest)
plt.plot(fpr, tpr, "b:", label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.legend(loc="lower right")
plt.show()
Indeed the ROC curve looks better:
But the confusion matrix and the precision score are extremely weird:
from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train, cv=3, method='predict_proba')
y_scores_forest = y_probas_forest[:,1]
fpr_forest, tpr_forest, threshold_forest = roc_curve(y_train,y_scores_forest)
forest_clf.fit(X_train,y_train)
y_pred = forest_clf.predict(X_train)
# f1 score
f1_score(y_train, y_pred)
# confusion matrix
from sklearn.metrics import plot_confusion_matrix
import matplotlib.pyplot as plt
disp = plot_confusion_matrix(forest_clf, X_train, y_train,
cmap=plt.cm.Blues,
normalize='true')
The F score is also 1. I do not understand what is going on here. I suspect I made a mistake, but the fact that the SGD classifier seems to work ok makes me think this is not about data cleaning.
Any idea of what might be going wrong?
#UPDATE:
1) Confusion Matrix in absolute terms:
2) Reducing the threshold:
The reason you have perfect score is because you are not doing your metrics on test data.
In the first paragraph, you're doing the 80/20 split for training and test data, but then all the metrics ROCs, confusion matrices etc. are done on the original training data instead on test data.
With setup like that, your reports would show you you're overfitting like crazy.
What you should do is to apply the trained model on your test data and look how that model does.