Search code examples
pythonscikit-learncross-validationshufflek-fold

What is the difference between a "normal" k-fold cross-validation using shuffle=True and a repeated k-fold cross-validation?


could anyone explain the difference between a "normal" k-fold cross-validation using the shuffle function, e.g.

kf = KFold(n_splits = 5, shuffle = True)

and a repeated k-fold cross-validation? Shouldn't they return the same results?

Having a hard time understanding the difference.

Any hint is appreciated.


Solution

  • As its name says, RepeatedKFold is a repeated KFold. It executes it n_repeats times. When n_repeats=1, the former performs exactly as the latter when shuffle=True. They do not return the same splits because random_state=None by default, that is, you did not specify it. Therefore, they use different seeds to (pseudo-)randomly shuffle data.

    When they have the same random_state and are repeated once, then both lead the same splits. For a deeper understanding try the following:

    import pandas as pd
    from sklearn.model_selection import KFold, RepeatedKFold
                         
    data = pd.DataFrame([['red', 'strawberry'], # color, fruit
                      ['red', 'strawberry'], 
                      ['red', 'strawberry'],
                      ['red', 'strawberry'],
                      ['red', 'strawberry'],
                      ['yellow', 'banana'],
                      ['yellow', 'banana'],
                      ['yellow', 'banana'],
                      ['yellow', 'banana'],
                      ['yellow', 'banana']])
    
    X = data[0]
    
    # KFold
    for train_index, test_index in KFold(n_splits=2, shuffle=True, random_state=1).split(X):
        print("TRAIN:", train_index, "TEST:", test_index)
    
    # RepeatedKFold
    for train_index, test_index in RepeatedKFold(n_splits=2, n_repeats=1, random_state=1).split(X):
        print("TRAIN:", train_index, "TEST:", test_index)
    

    You should obtain the following:

    TRAIN: [1 3 5 7 8] TEST: [0 2 4 6 9]
    TRAIN: [0 2 4 6 9] TEST: [1 3 5 7 8]
    
    TRAIN: [1 3 5 7 8] TEST: [0 2 4 6 9]
    TRAIN: [0 2 4 6 9] TEST: [1 3 5 7 8]