Search code examples
pythonpandaspandas-groupby

Subset pandas dataframe up to when condition is met the first time


I have not had any luck accomplishing a task, where I want to subset a pandas dataframe up to a value, and grouping by their id. In the actual dataset I have several columns in between 'id' and 'status'

For example:

d = {'id': [1,1,1,1,1,1,1,2,2,2,2,2,2,2], 'status': [0,0,0,0,1,1,1,0,0,0,0,1,0,1]}
df = pd.DataFrame(data=d)

    id  status
0    1       0
1    1       0
2    1       0
3    1       0
4    1       1
5    1       1
6    1       1
7    2       0
8    2       0
9    2       0
10   2       0
11   2       1
12   2       0
13   2       1

The desired subset would be:

    id  status
0    1       0
1    1       0
2    1       0
3    1       0
4    1       1
5    2       0
6    2       0
7    2       0
8    2       0
9    2       1

Solution

  • Let's try groupby + cumsum:

    df = df.groupby('id', group_keys=False)\
           .apply(lambda x: x[x.status.cumsum().cumsum().le(1)])\
           .reset_index(drop=1)
    df
    
       id  status
    0   1       0
    1   1       0
    2   1       0
    3   1       0
    4   1       1
    5   2       0
    6   2       0
    7   2       0
    8   2       0
    9   2       1
    

    Here's an alternative that performs a groupby to create a mask to be used as an indexer:

    df = df[df.status.eq(1).groupby(df.id)\
              .apply(lambda x: x.cumsum().cumsum().le(1))]\
              .reset_index(drop=1)
    df
    
       id  status
    0   1       0
    1   1       0
    2   1       0
    3   1       0
    4   1       1
    5   2       0
    6   2       0
    7   2       0
    8   2       0
    9   2       1