Search code examples
pythonmachine-learningclassificationcross-validationshuffle

Why does shuffling training data for cross validation increase performance?


I am working on unbalanced dataset and I noticed that strangely if I shuffle the data during cross validation I get a high value of the f1 score while if i do not shuffle it f1 is low. Here is the function I use for cross validation:

def train_cross_v(md,df_train,n_folds=5,shuffl=False):

        X,y=df_train.drop([variable],axis=1),df_train[variable]
    
        cv =StratifiedKFold(n_splits=n_folds,shuffle=shuffl)

        scores = cross_val_score(md,X,y, scoring='f1', cv=cv, n_jobs=-1)

        y_pred=cross_val_predict(md,X,y, cv=cv, n_jobs=-1)
        print(' f1: ',scores,np.mean(scores))
        print(confusion_matrix(y_pred,y))
        return np.mean(scores)

Now shuffling I get f1 around 0.82:

nfolds=5
train_cross_v(XGBClassifier(),df_train,n_folds=nfolds,shuffl=True)
f1:  [0.81469793 0.82076749 0.82726257 0.82379249 0.82484862] 0.8222738195197493
[[23677  2452]
[ 1520  9126]]
0.8222738195197493

While not shuffling leads to:

nfolds=5
train_cross_v(XGBClassifier(),df_train,n_folds=nfolds,shuffl=False) 

f1:  [0.67447073 0.55084022 0.4166443  0.52759421 0.64819164] 0.5635482198057791
[[21621  5624]
[ 3576  5954]]
0.5635482198057791

As I understand it, shuffling is preferred to assess the real performance of the model as it allows us to neglect any dependencies related to the ordering of the data, and usually the post shuffling value of the performance metric is lower than that without shuffling. In my case however the behavior is the exact opposite and I get a high value if I shuffle, and the values ​​of the predictions on the test set remain unchanged. What could be the problem here?


Solution

  • Because the order of your data is important. Let's consider the following example:

    1. Suppose we have completely balanced labels:

    [0, 1, 0, 1, 0, 1, 0, 1, 0, ...]

    1. And the features matrix that matches the labels, i.e:
    [
    [0, 1, 0, 1, ..],
    ]
    
    1. Suppose the first 25% of the data are noisy and have incorrect labels:
    n_noisy = int(n_examples * 0.25)
    X[:n_noisy] = 1 - X[:n_noisy]
    

    So we have: [25% noisy, 25% normal, 25% normal, 25% normal]

    1. Now we are using 2-fold cross validation (2 for simplicity).

    4.1 without shuffling we will have the following metrics:

     f1:  [0.5 0. ] 0.25  # metrics for the second fold is zero
    

    The first fold will be trained on the second half of data ([25% normal, 25% normal]) which have no noise in it and tested on the first half ([25%noisy, 25% normal]) which have 50% of noise in it which results in f1=0.5.

    The second fold will be trained on the first half of data which were inverted as a result f1=0

    4.2 with shuffling:

     f1:  [0.74903475 0.75103734] 0.7500360467165447
    

    As expected we have f1=75% because 25% are noise.

    Source code:

    from xgboost import XGBClassifier
    import numpy as np
    from sklearn.model_selection import StratifiedKFold, cross_val_score, cross_val_predict
    from sklearn.metrics import confusion_matrix
    
    
    def train_cross_v(md, X, y, n_folds=5, shuffl=False):
    
        cv = StratifiedKFold(n_splits=n_folds, shuffle=shuffl)
    
        scores = cross_val_score(md, X, y, scoring="f1", cv=cv, n_jobs=-1)
    
        y_pred = cross_val_predict(md, X, y, cv=cv, n_jobs=-1)
        print(" f1: ", scores, np.mean(scores))
        print(confusion_matrix(y_pred, y))
        return np.mean(scores)
    
    
    nfolds = 2
    n_examples = 1000
    
    y = np.tile([0, 1], 500)
    X = y.copy().reshape(-1, 1)
    
    n_noisy = int(n_examples * 0.25)
    X[:n_noisy] = 1 - X[:n_noisy]
    
    
    train_cross_v(XGBClassifier(), X, y, n_folds=nfolds, shuffl=False)
    train_cross_v(XGBClassifier(), X, y, n_folds=nfolds, shuffl=True)
    

    So the order matters and shuffling can both increase or decrease performance.