Search code examples
pythoncross-validationk-fold

Does StratifiedKFold splits the same each time a for loop is called?


I use StratifiedKFold and a form of grid search for my Logistic Regression.

skf = StratifiedKFold(n_splits=6, shuffle=True, random_state=SEED)

I call this for loop for each combination of parameters:

for fold, (trn_idx, test_idx) in enumerate(skf.split(X, y)):

My question is, are trn_idx and test_idx the same for each fold every time I run the loop?

For example, if fold0 contains trn_dx = [1,2,5,7,8] and test_idx = [3,4,6], is fold0 going to contain the same trn_idx and test_idx the next 5 times I run the loop?


Solution

  • Yes, the stratified k-fold split is fixed if random_state=SEED is fixed. The shuffle only shuffles the dataset along with their targets before the k-fold split.

    This means that each fold will always have their indexes:

    
    x = list(range(10))
    y = [1]*5 + [2]*5
    
    from sklearn.model_selection import StratifiedKFold
    
    skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
    
    for fold, (trn_idx, test_idx) in enumerate(skf.split(x, y)):
        print(trn_idx, test_idx)
    
    

    Output:

    [1 2 4 5 7 9] [0 3 6 8]
    [0 1 3 5 6 8 9] [2 4 7]
    [0 2 3 4 6 7 8] [1 5 9]
    
    

    No matter how may times I run this code.