Search code examples
pythonpandasvectorization

How to make a group with condition (Python Pandas) without for loop?


I have a dataframe like below :

id_ = ['A','A','A','A','A','B','B','B','B']
days = [1,10,20,25,100,10,20,30,50]
event = [0,0,0,1,0,0,0,1,0]
df= pd.DataFrame({'id':id_,'days':days,'event':event})

Input:

  id  days  event
0  A     1      0
1  A    10      0
2  A    20      0
3  A    25      1
4  A   100      0
5  B    10      0
6  B    20      0
7  B    30      1
8  B    50      0

I want to make a group around event = 1 :

I want to group the rows within 30 days of the row with event = 1, per id.

I expect the output below (without for loop):

  id  days  event  group
0  A     1      0    1.0
1  A    10      0    1.0
2  A    20      0    1.0
3  A    25      1    1.0
4  A   100      0    NaN
5  B    10      0    2.0
6  B    20      0    2.0
7  B    30      1    2.0
8  B    50      0    2.0

Solution

  • You can use a merge_asof:

    df['group'] = (pd.merge_asof(
                    df.reset_index().sort_values(by='days'),
                    df[df['event'].eq(1)].sort_values(by='days'),
                    by='id', on='days',
                    direction='nearest', tolerance=29)
                     .set_index('index')['event_y']
                  )
    df['group'] *= pd.factorize(df['id'])[0]+1
    

    Output:

      id  days  event  group
    0  A     1      0    1.0
    1  A    10      0    1.0
    2  A    20      0    1.0
    3  A    25      1    1.0
    4  A   100      0    NaN
    5  B    10      0    2.0
    6  B    20      0    2.0
    7  B    30      1    2.0
    8  B    50      0    2.0