Let's say I have a pandas dataframe df
. The df
contains 1,000 rows. Like below.
print(df)
id class
0 0000799a2b2c42d 0
1 00042890562ff68 0
2 0005364cdcb8e5b 0
3 0007a5a46901c56 0
4 0009283e145448e 0
... ... ...
995 04309a8361c5a9e 0
996 0430bde854b470e 0
997 0431c56b712b9a5 1
998 043580af9803e8c 0
999 043733a88bfde0c 0
And it has 950 data as class 0
and 50 data as class 1
.
Now I want to add one more column as fold
, like below.
id class fold
0 0000799a2b2c42d 0 0
1 00042890562ff68 0 0
2 0005364cdcb8e5b 0 0
3 0007a5a46901c56 0 0
4 0009283e145448e 0 0
... ... ... ...
995 04309a8361c5a9e 0 4
996 0430bde854b470e 0 4
997 0431c56b712b9a5 1 4
998 043580af9803e8c 0 4
999 043733a88bfde0c 0 4
where the fold
column contains 5 folds(0,1,2,3,4). And each fold has 200 data, where 190 data as class 0
and 10 data as class 1
(by which means preserving the percentage of samples for each class
).
I've tried StratifiedShuffleSplit
from sklearn.model_selection
, like below.
sss = StratifiedShuffleSplit(n_split=5, random_state=2021, test_size = 0.2)
for _, val_index in sss.split(df.id, df.class):
....
Then I regard every list of val_index
as one specific fold, but it ends up giving me duplicates in each val_index
.
Can someone help me?
What you need is a kfold used for cross validation, not a train test split. You can use StratifiedKFold
, for example your dataset is like this:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
np.random.seed(12345)
df = pd.DataFrame({'id' : np.random.randint(1,1e5,1000),
'class' :np.random.binomial(1,0.1,1000)})
df['fold'] = np.NaN
We use the kfold, iterate through like you did and assign the fold number:
skf = StratifiedKFold(n_splits=5,shuffle=True)
for fold, [train,test] in enumerate(skf.split(df,df['class'])):
df.loc[test,"fold"] = fold
End product:
pd.crosstab(df['fold'],df['class'])
class 0 1
fold
0.0 182 18
1.0 182 18
2.0 182 18
3.0 182 18
4.0 181 19