How to divide Data Frame on 2 separated datasets 70%/30% of unique ids taking all rows for each id in Python Pandas?

I have Data Frame in Python Pandas like below:

Input data:

df = pd.DataFrame({
    'id' : [999, 999, 999, 185, 185, 185, 44, 44, 44],
    'target' : [1, 1, 1, 0, 0, 0, 1, 1, 1],
    'event_date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-01', '2023-01-02', '2023-01-03', '2023-01-01', '2023-01-02', '2023-01-03'],
    'event1': [1, 6, 11, 16, np.nan, 22, 74, 109, 52],
    'event2': [2, 7, np.nan, 17, 22, np.nan, np.nan, 10, 5],
    'event3': [3, 8, 13, 18, 23, np.nan, 2, np.nan, 99],
    'event4': [4, 9, np.nan, np.nan, np.nan, 11, 8, np.nan, np.nan],
    'event5': [5, np.nan, 15, 20, 25, 1, 1, 3, np.nan]
})

# Wypełnienie brakujących wartości zerami
df = df.fillna(0)
df

Requirements:

My real dataset has of course many more data, but I need to divide my dataset on 2 separate datasets (train and test) based on the following requirements:

For train dataset I need to take 70% of unique id from my input dataset
For test dataset I need to take 30% of unique id from my input dataset
For each new dataset I need to take all rows for each id, be aware that each id have same number of rows in my input dataset (it cannot be the case that for some id we take only 2 rows, for other id we take 3, always taking all rows of a given id)

Example of needed result (of course in real data should be proportion 70% /30% of unique ids):

train dataset:

df = pd.DataFrame({
    'id' : [999, 999, 999, 185, 185, 185],
    'target' : [1, 1, 1, 0, 0, 0],
    'event_date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-01', '2023-01-02', '2023-01-03'],
    'event1': [1, 6, 11, 16, np.nan, 22],
    'event2': [2, 7, np.nan, 17, 22, np.nan],
    'event3': [3, 8, 13, 18, 23, np.nan],
    'event4': [4, 9, np.nan, np.nan, np.nan, 11],
    'event5': [5, np.nan, 15, 20, 25, 1]
})

df = df.fillna(0)
df

test dataset:

df = pd.DataFrame({
    'id' : [44, 44, 44],
    'target' : [1, 1, 1],
    'event_date': ['2023-01-01', '2023-01-02', '2023-01-03'],
    'event1': [74, 109, 52],
    'event2': [ np.nan, 10, 5],
    'event3': [2, np.nan, 99],
    'event4': [8, np.nan, np.nan],
    'event5': [1, 3, np.nan]
})

# Wypełnienie brakujących wartości zerami
df = df.fillna(0)
df

Solution

Remove the duplicates with drop_duplicates, sample, then use this to build a mask with isin to split the data:

keep = df['id'].drop_duplicates().sample(frac=0.7)

m = df['id'].isin(keep)

train = df[m]
test = df[~m]

Variant using a set:

import random

S = list(set(df['id']))

keep = random.sample(S, k=round(len(S)*0.7))

m = df['id'].isin(keep)
train = df[m]
test = df[~m]

Example output:

# test
    id  target  event_date  event1  event2  event3  event4  event5
0  999       1  2023-01-01     1.0     2.0     3.0     4.0     5.0
1  999       1  2023-01-02     6.0     7.0     8.0     9.0     0.0
2  999       1  2023-01-03    11.0     0.0    13.0     0.0    15.0
6   44       1  2023-01-01    74.0     0.0     2.0     8.0     1.0
7   44       1  2023-01-02   109.0    10.0     0.0     0.0     3.0
8   44       1  2023-01-03    52.0     5.0    99.0     0.0     0.0

# train
    id  target  event_date  event1  event2  event3  event4  event5
3  185       0  2023-01-01    16.0    17.0    18.0     0.0    20.0
4  185       0  2023-01-02     0.0    22.0    23.0     0.0    25.0
5  185       0  2023-01-03    22.0     0.0     0.0    11.0     1.0