Search code examples
pythonpandastrain-test-split

How to split data into train and test keeping in mind the groupby column in pandas?


I would like to split the data set into test and train dataset in the ratio 20:80. However, while splitting, I do not want to split in a manner that 1 S_Id value has few data points in train and other data points in test.

I have a dataset as:

S_Id      Datetime               Item      
1         29-06-2018 03:23:00    654
1         29-06-2018 04:01:00    452
1         29-06-2018 04:25:00    101
2         30-06-2018 05:17:00    088
2         30-06-2018 05:43:00    131
3         30-06-2018 10:36:00    013
3         30-06-2018 11:19:00    092

I would like to split neatly as something like: Train:

S_Id      Datetime               Item      
1         29-06-2018 03:23:00    654
1         29-06-2018 04:01:00    452
1         29-06-2018 04:25:00    101
2         30-06-2018 05:17:00    088
2         30-06-2018 05:43:00    131

Test:

S_Id      Datetime               Item 
3         30-06-2018 10:36:00    013
3         30-06-2018 11:19:00    092

All same S_Ids must be in one set. Can it be done through simple 'groupby'?

Thank you for your help!


Solution

  • I don't believe there is a direct function that does so, so you could write a customized one :

    def sample_(we_array, train_size):
        """
         we_array : used as the weight of each unique element on your S_id column, 
         it's normalized to represent a probability
    
        """
        idx = np.arange(we_array.size) #get the index for each element on the array
        np.random.shuffle(idx) #shuffle it 
        cum = we_array[idx].cumsum()  
        train_idx = idx[cum<train_size]# we take the first elements until we have  
                                       # our desired size
        test_idx = idx[cum>=train_size]
        return train_idx, test_idx
    
    idx = df.S_Id.values
    unique, counts = np.unique(idx, return_counts = True) # we access the unique 
                                                          # elements and their cout
    probability = counts/counts.sum()
    train_idx, test_idx = sample_(probability, 0.8)
    train = df[df.S_Id.isin(unique[train_idx])] 
    test = df[df.S_Id.isin(unique[test_idx])]