I would like to split the data set into test and train dataset in the ratio 20:80. However, while splitting, I do not want to split in a manner that 1 S_Id value has few data points in train and other data points in test.
I have a dataset as:
S_Id Datetime Item
1 29-06-2018 03:23:00 654
1 29-06-2018 04:01:00 452
1 29-06-2018 04:25:00 101
2 30-06-2018 05:17:00 088
2 30-06-2018 05:43:00 131
3 30-06-2018 10:36:00 013
3 30-06-2018 11:19:00 092
I would like to split neatly as something like: Train:
S_Id Datetime Item
1 29-06-2018 03:23:00 654
1 29-06-2018 04:01:00 452
1 29-06-2018 04:25:00 101
2 30-06-2018 05:17:00 088
2 30-06-2018 05:43:00 131
Test:
S_Id Datetime Item
3 30-06-2018 10:36:00 013
3 30-06-2018 11:19:00 092
All same S_Ids must be in one set. Can it be done through simple 'groupby'?
Thank you for your help!
I don't believe there is a direct function that does so, so you could write a customized one :
def sample_(we_array, train_size):
"""
we_array : used as the weight of each unique element on your S_id column,
it's normalized to represent a probability
"""
idx = np.arange(we_array.size) #get the index for each element on the array
np.random.shuffle(idx) #shuffle it
cum = we_array[idx].cumsum()
train_idx = idx[cum<train_size]# we take the first elements until we have
# our desired size
test_idx = idx[cum>=train_size]
return train_idx, test_idx
idx = df.S_Id.values
unique, counts = np.unique(idx, return_counts = True) # we access the unique
# elements and their cout
probability = counts/counts.sum()
train_idx, test_idx = sample_(probability, 0.8)
train = df[df.S_Id.isin(unique[train_idx])]
test = df[df.S_Id.isin(unique[test_idx])]