python python-3.x scikit-learn cross-validation train-test-split

How to ensure all samples from specific group are all togehter in train/test in sklearn cross_val_predict?

I have a dataframe, where each sample belong to a group. For example:

df = a b c group
     1 1 2  G1
     1 6 1  G1
     8 2 8  G3
     2 8 7  G2
     1 9 2  G2
     1 7 2  G3
     4 0 2  G4
     1 5 1  G4
     6 7 8  G5
     3 3 7  G6
     1 2 2  G6
     1 0 5  G7

I want to run cross_val_predict, while ensuring that all of the samples from the same group are in the test or all are in the train. I want to split the data to 4 folds - but be sure that all the rows from the same group are together in the test or in the train.

So, for example, rows 0,1 and rows 4,5 will be in the train but rows 3, 6 (G3) will be in the test

Is this possible? I saw the group arg in the docs, but it is not very clear and I didn't find any examples.

Solution

Use GroupKFold as parameter for cv in cross_val_predict():

scores = cross_val_score(model, X, y, groups, cv=GroupKFold())

Note that groups array represents groups in the data that we want to be in the same training/test set. It is NOT an array of class labels.

For example:

from sklearn.datasets import make_blobs
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GroupKFold, cross_val_score

X, y = make_blobs(n_samples=15, random_state=0)

model = LogisticRegression()
groups = [0,0,0,1,1,1,1,2,2,2,2,3,3,3,3]
scores = cross_val_score(model, X, y, groups, cv=GroupKFold(n_splits=3))

print('cross val scores: {}'.format(scores))