Search code examples
pythonscikit-learncross-validationk-fold

Split k-fold where each fold of validation data doesn't include duplicates


Let's say I have a pandas dataframe df. The df contains 1,000 rows. Like below.

print(df)

                    id    class
0      0000799a2b2c42d       0
1      00042890562ff68       0
2      0005364cdcb8e5b       0
3      0007a5a46901c56       0
4      0009283e145448e       0
...                ...     ...
995    04309a8361c5a9e       0
996    0430bde854b470e       0
997    0431c56b712b9a5       1
998    043580af9803e8c       0
999    043733a88bfde0c       0

And it has 950 data as class 0 and 50 data as class 1.

Now I want to add one more column as fold, like below.

                    id    class  fold
0      0000799a2b2c42d       0     0
1      00042890562ff68       0     0
2      0005364cdcb8e5b       0     0
3      0007a5a46901c56       0     0
4      0009283e145448e       0     0
...                ...     ...   ...
995    04309a8361c5a9e       0     4
996    0430bde854b470e       0     4
997    0431c56b712b9a5       1     4
998    043580af9803e8c       0     4
999    043733a88bfde0c       0     4

where the fold column contains 5 folds(0,1,2,3,4). And each fold has 200 data, where 190 data as class 0 and 10 data as class 1(by which means preserving the percentage of samples for each class).

I've tried StratifiedShuffleSplit from sklearn.model_selection, like below.

sss = StratifiedShuffleSplit(n_split=5, random_state=2021, test_size = 0.2)
for _, val_index in sss.split(df.id, df.class):
    ....

Then I regard every list of val_index as one specific fold, but it ends up giving me duplicates in each val_index.

Can someone help me?


Solution

  • What you need is a kfold used for cross validation, not a train test split. You can use StratifiedKFold, for example your dataset is like this:

    import pandas as pd
    import numpy as np
    from sklearn.model_selection import StratifiedKFold
    
    np.random.seed(12345)
    df = pd.DataFrame({'id' : np.random.randint(1,1e5,1000),
    'class' :np.random.binomial(1,0.1,1000)})
    df['fold'] = np.NaN
    

    We use the kfold, iterate through like you did and assign the fold number:

    skf = StratifiedKFold(n_splits=5,shuffle=True)
    for fold, [train,test] in enumerate(skf.split(df,df['class'])):
        df.loc[test,"fold"] = fold
    

    End product:

    pd.crosstab(df['fold'],df['class'])
    
    class    0   1
    fold          
    0.0    182  18
    1.0    182  18
    2.0    182  18
    3.0    182  18
    4.0    181  19