I have a dataframe, where each sample belong to a group. For example:
df = a b c group
1 1 2 G1
1 6 1 G1
8 2 8 G3
2 8 7 G2
1 9 2 G2
1 7 2 G3
4 0 2 G4
1 5 1 G4
6 7 8 G5
3 3 7 G6
1 2 2 G6
1 0 5 G7
I want to run cross_val_predict
, while ensuring that all of the samples from the same group are in the test or all are in the train.
I want to split the data to 4 folds - but be sure that all the rows from the same group are together in the test or in the train.
So, for example, rows 0,1 and rows 4,5 will be in the train but rows 3, 6 (G3) will be in the test
Is this possible? I saw the group
arg in the docs, but it is not very clear and I didn't find any examples.
Use GroupKFold
as parameter for cv
in cross_val_predict()
:
scores = cross_val_score(model, X, y, groups, cv=GroupKFold())
Note that groups
array represents groups in the data that we want to be in the same training/test set. It is NOT an array of class labels.
For example:
from sklearn.datasets import make_blobs
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GroupKFold, cross_val_score
X, y = make_blobs(n_samples=15, random_state=0)
model = LogisticRegression()
groups = [0,0,0,1,1,1,1,2,2,2,2,3,3,3,3]
scores = cross_val_score(model, X, y, groups, cv=GroupKFold(n_splits=3))
print('cross val scores: {}'.format(scores))