Search code examples
pythonscikit-learnrandom-forestcross-validationimbalanced-data

F-Score difference between cross_val_score and StratifiedKFold


I want to use a Random Forest Classifier on imbalanced data where X is a np.array representing the features and y is a np.array representing the labels (labels with 90% 0-values, and 10% 1-values). As I was not sure how to do stratification within Cross Validation and if it makes a difference I also manually cross validated with StratifiedKFold. I would expect not same but somewhat similar results. As this is not the case I guess that I wrongly use one method but I don´t understand which one. Here is the code

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split
from sklearn.metrics import f1_score

rfc = RandomForestClassifier(n_estimators = 200,
                             criterion = "gini",
                             max_depth = None, 
                             min_samples_leaf = 1, 
                             max_features = "auto", 
                             random_state = 42,
                             class_weight = "balanced")

X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42, stratify=y)

I also tried the Classifier without the class_weight argument. From here I proceed to compare both methods with the f1-score

cv = cross_val_score(estimator=rfc,
                     X=X_train_val,
                     y=y_train_val,
                     cv=10,
                     scoring="f1")
print(cv)

The 10 f1-scores from cross validation are all around 65%. Now the StratifiedKFold:

skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42) 
for train_index, test_index in skf.split(X_train_val, y_train_val):
    X_train, X_val = X_train_val[train_index], X_train_val[test_index]
    y_train, y_val = y_train_val[train_index], y_train_val[test_index]
    rfc.fit(X_train, y_train)
    rfc_predictions = rfc.predict(X_val)
    print("F1-Score: ", round(f1_score(y_val, rfc_predictions),3))

The 10 f1-scores from StratifiedKFold gets me values around 90%. This is where I get confused as I don´t understand the large deviations between both methods. If I just fit the Classifier to the train data and apply it to the test data I get f1-scores of around 90% as well which lets me believe that my way of applying cross_val_score is not correct.


Solution

  • One possible reason for the difference is that cross_val_score uses StratifiedKFold with the default shuffle=False parameter, whereas in your manual cross-validation using StratifiedKFold you have passed shuffle=True. Therefore it could just be an artifact of the way your data is ordered that cross-validating without shuffling produces worse F1 scores.

    Try passing shuffle=False when creating the skf instance to see if the scores match the cross_val_score, and then if you want to use shuffling when using cross_val_score just manually shuffle the training data before applying cross_val_score.