Search code examples
pandasgroup-by

Pandas groupby: Three binary columns representing three events; overwrite with most recent event


Below is a toy Pandas dataframe that has five columns: 'id' (group id), 't' (time), 'A' (Event A), 'B' (Event B), 'C' (Event C):

df = pd.DataFrame({'id' : [1,1,1,1,1,1,1,1,1,1,  2,2,2,2,2,2,2,2,2,2], 
                   't' :  [1,2,3,4,5,6,7,8,9,10, 1,2,3,4,5,6,7,8,9,10],
                   'A'  : [0,0,0,0,1,1,1,0,0,0,  1,1,1,1,1,0,0,0,0,0], 
                   'B'  : [0,0,1,1,1,1,0,0,0,0,  0,0,1,1,1,1,1,0,0,0],
                   'C'  : [1,1,1,1,0,0,0,0,0,0,  0,0,0,0,1,1,1,1,1,0]})
print(df)

    id   t  A  B  C
0    1   1  0  0  1
1    1   2  0  0  1
2    1   3  0  1  1
3    1   4  0  1  1
4    1   5  1  1  0
5    1   6  1  1  0
6    1   7  1  0  0
7    1   8  0  0  0
8    1   9  0  0  0
9    1  10  0  0  0
10   2   1  1  0  0
11   2   2  1  0  0
12   2   3  1  1  0
13   2   4  1  1  0
14   2   5  1  1  1
15   2   6  0  1  1
16   2   7  0  1  1
17   2   8  0  0  1
18   2   9  0  0  1
19   2  10  0  0  0

In Columns A-C, the value is 1 if the event was occurring in the given time, and 0 otherwise.

For each group, based on the start times of the three events, I want to replace 1 with 0 when there is a more recent event(s) occurring, so that there is no overlap between any of the events (i.e., there is no row where the sum of A, B, and C is greater than 1). To clarify, below is the desired output:

desired_output_df = pd.DataFrame({'id' : [1,1,1,1,1,1,1,1,1,1,  2,2,2,2,2,2,2,2,2,2], 
                                  't' :  [1,2,3,4,5,6,7,8,9,10, 1,2,3,4,5,6,7,8,9,10],
                                  'A'  : [0,0,0,0,1,1,1,0,0,0,  1,1,0,0,0,0,0,0,0,0], 
                                  'B'  : [0,0,1,1,0,0,0,0,0,0,  0,0,1,1,0,0,0,0,0,0],
                                  'C'  : [1,1,0,0,0,0,0,0,0,0,  0,0,0,0,1,1,1,1,1,0]})
print(desired_output_df)
    id   t  A  B  C
0    1   1  0  0  1
1    1   2  0  0  1
2    1   3  0  1  0
3    1   4  0  1  0
4    1   5  1  0  0
5    1   6  1  0  0
6    1   7  1  0  0
7    1   8  0  0  0
8    1   9  0  0  0
9    1  10  0  0  0
10   2   1  1  0  0
11   2   2  1  0  0
12   2   3  0  1  0
13   2   4  0  1  0
14   2   5  0  0  1
15   2   6  0  0  1
16   2   7  0  0  1
17   2   8  0  0  1
18   2   9  0  0  1
19   2  10  0  0  0

I have tried using np.where, without much success. I would appreciate your help.


Solution

  • You main logic is there is no row where the sum of A, B, and C is greater than 1 is a very good hint So we want to identify

    • the events (blocks of 1 within each id)
    • the event start times
    • for each row, keep the event with smallest start time

    Let's try:

    cols = ['A','B','C']
    
    # identify the events
    events = df[cols].apply(lambda x: x.rsub(1).groupby(df['id']).cumsum()).where(df[cols]==1)
    
    # identify the event start time
    event_starts = events.apply(lambda x: df['t'].groupby([df['id'], x]).transform('min'))
    
    # keep the event with smallest start time by comparison  
    with_smallest_start = event_starts.eq(event_starts.min(axis=1), axis=0).astype(int)
    

    output:

        A  B  C
    0   0  0  1
    1   0  0  1
    2   0  0  1
    3   0  0  1
    4   0  1  0
    5   0  1  0
    6   1  0  0
    7   0  0  0
    8   0  0  0
    9   0  0  0
    10  1  0  0
    11  1  0  0
    12  1  0  0
    13  1  0  0
    14  1  0  0
    15  0  1  0
    16  0  1  0
    17  0  0  1
    18  0  0  1
    19  0  0  0
    

    Note: if you have events that start at the same time, then some rows of with_smallest_start can have more than one 1. In which case, you can keep the first (left-most) event with:

    final = with_smallest_start.cumsum(axis=1).le(1)