Search code examples
pythonpandastrain-test-split

How to split a dataset to train/test where some rows are dependent?


I have a data set of subjects and each of them has a number of rows in my pandas dataframe (each measurement is a row and a subject could measure a few times). I would like to split my data into training and test set but I cannot split randomly because all subject's measurements are dependent (cannot put the same subject in the train and test). How would you reslove this? I have a pandas dataframe and each subject has a different number of measurements.

Edit: My data includes the subject number for each row and I would like to split as close to 0.8/0.2 as possible.


Solution

  • Consider the dataframe df with column user_id to identify users.

    df = pd.DataFrame(
        np.random.randint(5, size=(100, 4)), columns=['user_id'] + list('ABC')
    )
    

    You want to identify unique users and randomly select some. Then split your dataframe in order to put all test users in one and train users in the other.

    unique_users = df['user_id'].unique()
    train_users, test_users = np.split(
        np.random.permutation(unique_users), [int(.8 * len(unique_users))]
    )
    
    df_train = df[df['user_id'].isin(train_users)]
    df_test = df[df['user_id'].isin(test_users)]
    

    This should roughly split your data into 80/20.


    However, if you care to keep it as balanced as possible, then you must add users incrementally.

    unique_users = df['user_id'].unique()
    target_n = int(.8 * len(df))
    shuffled_users = np.random.permutation(unique_users)
    
    user_count = df['user_id'].value_counts()
    
    mapping = user_count.reindex(shuffled_users).cumsum() <= target_n
    mask = df['user_id'].map(mapping)
    
    df_train = df[mask]
    df_test = df[~mask]