Below is a toy Pandas dataframe that has five columns: 'id' (group id), 't' (time), 'A' (Event A), 'B' (Event B), 'C' (Event C):
df = pd.DataFrame({'id' : [1,1,1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2,2,2],
't' : [1,2,3,4,5,6,7,8,9,10, 1,2,3,4,5,6,7,8,9,10],
'A' : [0,0,0,0,1,1,1,0,0,0, 1,1,1,1,1,0,0,0,0,0],
'B' : [0,0,1,1,1,1,0,0,0,0, 0,0,1,1,1,1,1,0,0,0],
'C' : [1,1,1,1,0,0,0,0,0,0, 0,0,0,0,1,1,1,1,1,0]})
print(df)
id t A B C
0 1 1 0 0 1
1 1 2 0 0 1
2 1 3 0 1 1
3 1 4 0 1 1
4 1 5 1 1 0
5 1 6 1 1 0
6 1 7 1 0 0
7 1 8 0 0 0
8 1 9 0 0 0
9 1 10 0 0 0
10 2 1 1 0 0
11 2 2 1 0 0
12 2 3 1 1 0
13 2 4 1 1 0
14 2 5 1 1 1
15 2 6 0 1 1
16 2 7 0 1 1
17 2 8 0 0 1
18 2 9 0 0 1
19 2 10 0 0 0
In Columns A-C, the value is 1 if the event was occurring in the given time, and 0 otherwise.
For each group, based on the start times of the three events, I want to replace 1 with 0 when there is a more recent event(s) occurring, so that there is no overlap between any of the events (i.e., there is no row where the sum of A, B, and C is greater than 1). To clarify, below is the desired output:
desired_output_df = pd.DataFrame({'id' : [1,1,1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2,2,2],
't' : [1,2,3,4,5,6,7,8,9,10, 1,2,3,4,5,6,7,8,9,10],
'A' : [0,0,0,0,1,1,1,0,0,0, 1,1,0,0,0,0,0,0,0,0],
'B' : [0,0,1,1,0,0,0,0,0,0, 0,0,1,1,0,0,0,0,0,0],
'C' : [1,1,0,0,0,0,0,0,0,0, 0,0,0,0,1,1,1,1,1,0]})
print(desired_output_df)
id t A B C
0 1 1 0 0 1
1 1 2 0 0 1
2 1 3 0 1 0
3 1 4 0 1 0
4 1 5 1 0 0
5 1 6 1 0 0
6 1 7 1 0 0
7 1 8 0 0 0
8 1 9 0 0 0
9 1 10 0 0 0
10 2 1 1 0 0
11 2 2 1 0 0
12 2 3 0 1 0
13 2 4 0 1 0
14 2 5 0 0 1
15 2 6 0 0 1
16 2 7 0 0 1
17 2 8 0 0 1
18 2 9 0 0 1
19 2 10 0 0 0
I have tried using np.where, without much success. I would appreciate your help.
You main logic is there is no row where the sum of A, B, and C is greater than 1 is a very good hint So we want to identify
Let's try:
cols = ['A','B','C']
# identify the events
events = df[cols].apply(lambda x: x.rsub(1).groupby(df['id']).cumsum()).where(df[cols]==1)
# identify the event start time
event_starts = events.apply(lambda x: df['t'].groupby([df['id'], x]).transform('min'))
# keep the event with smallest start time by comparison
with_smallest_start = event_starts.eq(event_starts.min(axis=1), axis=0).astype(int)
output:
A B C
0 0 0 1
1 0 0 1
2 0 0 1
3 0 0 1
4 0 1 0
5 0 1 0
6 1 0 0
7 0 0 0
8 0 0 0
9 0 0 0
10 1 0 0
11 1 0 0
12 1 0 0
13 1 0 0
14 1 0 0
15 0 1 0
16 0 1 0
17 0 0 1
18 0 0 1
19 0 0 0
Note: if you have events that start at the same time, then some rows of with_smallest_start
can have more than one 1
. In which case, you can keep the first (left-most) event with:
final = with_smallest_start.cumsum(axis=1).le(1)