Search code examples
pythonpandasscikit-learnsklearn-pandas

Stratify dataset while also avoiding contamination by Index?


As a reproducible example, I have the following data set:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

data = np.random.randint(0,20,size=(300, 5))
df = pd.DataFrame(data, columns=['ID', 'A', 'B', 'C', 'D'])
df = df.set_index(['ID'])

df.head()
Out: 
           A   B   C   D
ID                
12         3  14   4   7
9          5   9   8   4
12         18  17   3  14
1          0  10   1   0
9          10   5  11   9

I need to perform a 70%-30% stratified split (on y), which I know would look like this:

# Train/Test Split
X = df.iloc[:,0:-1] # Columns A, B, and C
y = df.iloc[:,-1] # Column D
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.70, test_size = 0.30, stratify = y)

However, although I want the training and testing sets to have the same (or similar enough) distribution of "D," I do not want the same "ID" to be present in both the testing and training.

How could I do this?


Solution

  • EDIT: A way to do (something like) what you asked could be to store the IDs by class, then for each class take 70% of IDs and insert samples with those IDs in Train, the rest in Test set.

    Note this still does not guarantee the distributions are the same, if each ID has different number of occurrences. Moreover, given each ID can belong to multiple classes in D and should not be shared between train and test sets, seeking identical distributions becomes a complex optimisation problem. This is because each ID can only be included in either train or test, and at the same time adds to the assigned set a variable number of classes that depends on the classes the given ID has over all rows it occurs in.

    A rather simpler way to split the data while approximating a balanced distribution is to iterate over the classes in random order and consider each ID only for one of the classes in its occurrences, therefore assigning it to train/test for all of its rows, hence removing it for future classes.

    I find that considering the ID as a column helps for this task, so I changed your provided code as follows:

    # Given snippet (modified)
    import pandas as pd
    import numpy as np
    from sklearn.model_selection import train_test_split
    
    data = np.random.randint(0,20,size=(300, 5))
    df = pd.DataFrame(data, columns=['ID', 'A', 'B', 'C', 'D'])
    

    The proposed solution:

    import random
    from collections import defaultdict
    
    classes = df.D.unique().tolist() # get unique classes,
    random.shuffle(classes)          # shuffle to eliminate positional biases
    ids_by_class = defaultdict(list)
    
    
    # iterate over classes
    temp_df = df.copy()
    for c in classes:
        c_rows = temp_df.loc[temp_df['D'] == c] # rows with given class
        ids = temp_df.ID.unique().tolist()      # IDs in these rows
        ids_by_class[c].extend(ids)
    
        # remove ids so they cannot be taken into account for other classes
        temp_df = temp_df[~temp_df.ID.isin(ids)]
    
    
    # now construct ids split, class by class
    train_ids, test_ids = [], []
    for c, ids in ids_by_class.items():
        random.shuffle(ids) # shuffling can eliminate positional biases
    
        # split the IDs
        split = int(len(ids)*0.7) # split at 70%
    
        train_ids.extend(ids[:split])
        test_ids.extend(ids[split:])
    
    # finally use the ids in train and test to get the
    # data split from the original df
    train = df.loc[df['ID'].isin(train_ids)]
    test = df.loc[df['ID'].isin(test_ids)]
    
    

    Let's test that the split ratio roughly conforms to 70/30, the data is preserved and no ID is shared between train and test dataframes:

    # 1) check that elements in Train are roughly 70% and Test 30% of original df
    print(f'Numbers of elements in train: {len(train)}, test: {len(test)}| Perfect split would be train: {int(len(df)*0.7)}, test: {int(len(df)*0.3)}')
    
    # 2) check that concatenating Train and Test gives back the original df
    train_test = pd.concat([train, test]).sort_values(by=['ID', 'A', 'B', 'C', 'D']) # concatenate dataframes into one, and sort
    sorted_df = df.sort_values(by=['ID', 'A', 'B', 'C', 'D']) # sort original df
    assert train_test.equals(sorted_df) # check equality
    
    # 3) check that the IDs are not shared between train/test sets
    train_id_set = set(train.ID.unique().tolist())
    test_id_set = set(test.ID.unique().tolist())
    assert len(train_id_set.intersection(test_id_set)) == 0
    

    Sample Outputs:

    Numbers of elements in train: 209, test: 91| Perfect split would be train: 210, test: 90
    Numbers of elements in train: 210, test: 90| Perfect split would be train: 210, test: 90
    Numbers of elements in train: 227, test: 73| Perfect split would be train: 210, test: 90