Search code examples
pythonpandasdataset

How to divide Data Frame on 2 separated datasets 70%/30% of unique ids taking all rows for each id in Python Pandas?


I have Data Frame in Python Pandas like below:

Input data:

df = pd.DataFrame({
    'id' : [999, 999, 999, 185, 185, 185, 44, 44, 44],
    'target' : [1, 1, 1, 0, 0, 0, 1, 1, 1],
    'event_date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-01', '2023-01-02', '2023-01-03', '2023-01-01', '2023-01-02', '2023-01-03'],
    'event1': [1, 6, 11, 16, np.nan, 22, 74, 109, 52],
    'event2': [2, 7, np.nan, 17, 22, np.nan, np.nan, 10, 5],
    'event3': [3, 8, 13, 18, 23, np.nan, 2, np.nan, 99],
    'event4': [4, 9, np.nan, np.nan, np.nan, 11, 8, np.nan, np.nan],
    'event5': [5, np.nan, 15, 20, 25, 1, 1, 3, np.nan]
})

# Wypełnienie brakujących wartości zerami
df = df.fillna(0)
df

enter image description here

Requirements:

My real dataset has of course many more data, but I need to divide my dataset on 2 separate datasets (train and test) based on the following requirements:

  1. For train dataset I need to take 70% of unique id from my input dataset
  2. For test dataset I need to take 30% of unique id from my input dataset
  3. For each new dataset I need to take all rows for each id, be aware that each id have same number of rows in my input dataset (it cannot be the case that for some id we take only 2 rows, for other id we take 3, always taking all rows of a given id)

Example of needed result (of course in real data should be proportion 70% /30% of unique ids):

train dataset:

df = pd.DataFrame({
    'id' : [999, 999, 999, 185, 185, 185],
    'target' : [1, 1, 1, 0, 0, 0],
    'event_date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-01', '2023-01-02', '2023-01-03'],
    'event1': [1, 6, 11, 16, np.nan, 22],
    'event2': [2, 7, np.nan, 17, 22, np.nan],
    'event3': [3, 8, 13, 18, 23, np.nan],
    'event4': [4, 9, np.nan, np.nan, np.nan, 11],
    'event5': [5, np.nan, 15, 20, 25, 1]
})

df = df.fillna(0)
df

test dataset:

df = pd.DataFrame({
    'id' : [44, 44, 44],
    'target' : [1, 1, 1],
    'event_date': ['2023-01-01', '2023-01-02', '2023-01-03'],
    'event1': [74, 109, 52],
    'event2': [ np.nan, 10, 5],
    'event3': [2, np.nan, 99],
    'event4': [8, np.nan, np.nan],
    'event5': [1, 3, np.nan]
})

# Wypełnienie brakujących wartości zerami
df = df.fillna(0)
df

Solution

  • Remove the duplicates with drop_duplicates, sample, then use this to build a mask with isin to split the data:

    keep = df['id'].drop_duplicates().sample(frac=0.7)
    
    m = df['id'].isin(keep)
    
    train = df[m]
    test = df[~m]
    

    Variant using a set:

    import random
    
    S = list(set(df['id']))
    
    keep = random.sample(S, k=round(len(S)*0.7))
    
    m = df['id'].isin(keep)
    train = df[m]
    test = df[~m]
    

    Example output:

    # test
        id  target  event_date  event1  event2  event3  event4  event5
    0  999       1  2023-01-01     1.0     2.0     3.0     4.0     5.0
    1  999       1  2023-01-02     6.0     7.0     8.0     9.0     0.0
    2  999       1  2023-01-03    11.0     0.0    13.0     0.0    15.0
    6   44       1  2023-01-01    74.0     0.0     2.0     8.0     1.0
    7   44       1  2023-01-02   109.0    10.0     0.0     0.0     3.0
    8   44       1  2023-01-03    52.0     5.0    99.0     0.0     0.0
    
    # train
        id  target  event_date  event1  event2  event3  event4  event5
    3  185       0  2023-01-01    16.0    17.0    18.0     0.0    20.0
    4  185       0  2023-01-02     0.0    22.0    23.0     0.0    25.0
    5  185       0  2023-01-03    22.0     0.0     0.0    11.0     1.0