I have Data Frame in Python Pandas like below:
Input data:
df = pd.DataFrame({
'id' : [999, 999, 999, 185, 185, 185, 44, 44, 44],
'target' : [1, 1, 1, 0, 0, 0, 1, 1, 1],
'event_date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-01', '2023-01-02', '2023-01-03', '2023-01-01', '2023-01-02', '2023-01-03'],
'event1': [1, 6, 11, 16, np.nan, 22, 74, 109, 52],
'event2': [2, 7, np.nan, 17, 22, np.nan, np.nan, 10, 5],
'event3': [3, 8, 13, 18, 23, np.nan, 2, np.nan, 99],
'event4': [4, 9, np.nan, np.nan, np.nan, 11, 8, np.nan, np.nan],
'event5': [5, np.nan, 15, 20, 25, 1, 1, 3, np.nan]
})
# Wypełnienie brakujących wartości zerami
df = df.fillna(0)
df
Requirements:
My real dataset has of course many more data, but I need to divide my dataset on 2 separate datasets (train and test) based on the following requirements:
Example of needed result (of course in real data should be proportion 70% /30% of unique ids):
train dataset:
df = pd.DataFrame({
'id' : [999, 999, 999, 185, 185, 185],
'target' : [1, 1, 1, 0, 0, 0],
'event_date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-01', '2023-01-02', '2023-01-03'],
'event1': [1, 6, 11, 16, np.nan, 22],
'event2': [2, 7, np.nan, 17, 22, np.nan],
'event3': [3, 8, 13, 18, 23, np.nan],
'event4': [4, 9, np.nan, np.nan, np.nan, 11],
'event5': [5, np.nan, 15, 20, 25, 1]
})
df = df.fillna(0)
df
test dataset:
df = pd.DataFrame({
'id' : [44, 44, 44],
'target' : [1, 1, 1],
'event_date': ['2023-01-01', '2023-01-02', '2023-01-03'],
'event1': [74, 109, 52],
'event2': [ np.nan, 10, 5],
'event3': [2, np.nan, 99],
'event4': [8, np.nan, np.nan],
'event5': [1, 3, np.nan]
})
# Wypełnienie brakujących wartości zerami
df = df.fillna(0)
df
Remove the duplicates with drop_duplicates
, sample
, then use this to build a mask with isin
to split the data:
keep = df['id'].drop_duplicates().sample(frac=0.7)
m = df['id'].isin(keep)
train = df[m]
test = df[~m]
Variant using a set
:
import random
S = list(set(df['id']))
keep = random.sample(S, k=round(len(S)*0.7))
m = df['id'].isin(keep)
train = df[m]
test = df[~m]
Example output:
# test
id target event_date event1 event2 event3 event4 event5
0 999 1 2023-01-01 1.0 2.0 3.0 4.0 5.0
1 999 1 2023-01-02 6.0 7.0 8.0 9.0 0.0
2 999 1 2023-01-03 11.0 0.0 13.0 0.0 15.0
6 44 1 2023-01-01 74.0 0.0 2.0 8.0 1.0
7 44 1 2023-01-02 109.0 10.0 0.0 0.0 3.0
8 44 1 2023-01-03 52.0 5.0 99.0 0.0 0.0
# train
id target event_date event1 event2 event3 event4 event5
3 185 0 2023-01-01 16.0 17.0 18.0 0.0 20.0
4 185 0 2023-01-02 0.0 22.0 23.0 0.0 25.0
5 185 0 2023-01-03 22.0 0.0 0.0 11.0 1.0