Search code examples
scikit-learntrain-test-splitpycaret

test/train splits in pycaret using a column for grouping rows that should be in the same split


My dataset contains a column with some data I need to use for splitting by groups in a way that rows belonging to same group should not be divided into train/test but sent as a whole to one of the splits using PYCARET

10 row sample for clarification:

group_id    measure1    measure2    measure3
    1          3455        3425       345
    1          6455         825       945
    1          6444         225       145
    2            23          34       233
    2           623          22       888
    3          3455        3425       345
    3          6155         525       645
    3          6434         325       845
    4            93         345       233
    4           693         222       808

every unique group_id should be sent to any split in full this way (using 80/20):

TRAIN SET:
   
 group_id    measure1    measure2    measure3
        1          3455        3425       345
        1          6455         825       945
        1          6444         225       145
        3          3455        3425       345
        3          6155         525       645
        3          6434         325       845
        4            93         345       233
        4           693         222       808

TEST SET:

 group_id    measure1    measure2    measure3
        2            23          34       233
        2           623          22       888

Solution

  • You can try the following per the documentation

    https://pycaret.readthedocs.io/en/latest/api/classification.html

    fold_strategy = "groupkfold"