Search code examples
pythonrandom-forestconfusion-matrix

random forest - "perfect" confusion matrix


I have a classification problem in which I would like to identify prospective borrowers which should not be invited for a meeting at a bank. In the data, ca. 25% of the borrowers should not be invited. I have around 4500 observations and 86 features (many dummies).

After cleaning the data, I do:

# Separate X_train and Y_train

X = ratings_prepared[:, :-1]
y= ratings_prepared[:,-1]

##################################################################################

# Separate test and train (stratified, 20% test)

import numpy as np
from sklearn.model_selection import StratifiedKFold

from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for train_index, test_index in skfolds.split(X,y):
    X_train = X[train_index]
    y_train = y[train_index]
    X_test = X[test_index]
    y_test = y[test_index]

Then, I proceed to the training the models. A SGD classifier does not work very well:

def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label =label)
    plt.plot([0,1], [0,1],'k--')
    plt.axis([0,1,0,1])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')

def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1],"b--", label="Precision")
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
    plt.xlabel("Threashold")
    plt.legend(loc="center left")
    plt.ylim([0,1])

############################# Train Models #############################

from sklearn.linear_model import SGDClassifier

sgd_clf =SGDClassifier(random_state=42)
sgd_clf.fit(X_train,y_train)
y_pred = sgd_clf.predict(X_train)

# f1 score

f1_score(y_train, y_pred)

# confusion matrix

tn, fp, fn, tp = confusion_matrix(y_train, y_pred).ravel()
(tn, fp, fn, tp)
from sklearn.metrics import plot_confusion_matrix
import matplotlib.pyplot as plt

disp = plot_confusion_matrix(sgd_clf, X_train, y_train,
                                 cmap=plt.cm.Blues,
                                 normalize='true')

# recall and precision

from sklearn.metrics import precision_score, recall_score
precision_score(y_train, y_pred)

# Precision Recall

from sklearn.metrics import precision_recall_curve

plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.show()

# Plot ROC curve
y_scores = cross_val_predict(sgd_clf, X_train, y_train, cv=3, method="decision_function")
fpr, tpr, thresholds = roc_curve(y_train, y_scores)

plot_roc_curve(fpr, tpr)
plt.show()

# recall and precision

from sklearn.metrics import precision_score, recall_score
precision_score(y_train, y_pred)
### Precision score: 0.5084427767354597

Results from the SGD classifier

I then move on to a random forest classifier, which should improve upon the SGD

from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train, cv=3, method='predict_proba')
y_scores_forest = y_probas_forest[:,1]
fpr_forest, tpr_forest, threshold_forest = roc_curve(y_train,y_scores_forest)

plt.plot(fpr, tpr, "b:", label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.legend(loc="lower right")
plt.show()

Indeed the ROC curve looks better:

ROC curve RF

But the confusion matrix and the precision score are extremely weird:

from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train, cv=3, method='predict_proba')
y_scores_forest = y_probas_forest[:,1]
fpr_forest, tpr_forest, threshold_forest = roc_curve(y_train,y_scores_forest)

forest_clf.fit(X_train,y_train)
y_pred = forest_clf.predict(X_train)


# f1 score

f1_score(y_train, y_pred)

# confusion matrix

from sklearn.metrics import plot_confusion_matrix
import matplotlib.pyplot as plt

disp = plot_confusion_matrix(forest_clf, X_train, y_train,
                                 cmap=plt.cm.Blues,
                                 normalize='true')

Confusion Matrix RF

The F score is also 1. I do not understand what is going on here. I suspect I made a mistake, but the fact that the SGD classifier seems to work ok makes me think this is not about data cleaning.

Any idea of what might be going wrong?

#

UPDATE:

1) Confusion Matrix in absolute terms:

enter image description here

2) Reducing the threshold:

enter image description here


Solution

  • The reason you have perfect score is because you are not doing your metrics on test data.

    In the first paragraph, you're doing the 80/20 split for training and test data, but then all the metrics ROCs, confusion matrices etc. are done on the original training data instead on test data.

    With setup like that, your reports would show you you're overfitting like crazy.

    What you should do is to apply the trained model on your test data and look how that model does.