Search code examples
pythonpandasif-statementgroup-by

How to select rows above a certain value grouping by a different column in python?


I have the following data frame:

Column A    Column B
MD223       GATE IN
MD223       GATE OUT
MD223       LOADED
MD223       DEPARTURE
SC511       GATE IN
SC511       LOADED
SC511       SHIPPED
KR977       DISPATCHED
KR977       LOADED
KR977       SHIPPED

What I am trying to do is for each group of the same values in column A to find the first row with the value 'LOADED' in column B and then return rows before the value 'LOADED'. So expected outcome would be:

Column A    Column B
MD223       GATE IN
MD223       GATE OUT
MD223       LOADED
SC511       GATE IN
SC511       LOADED
KR977       DISPATCHED
KR977       LOADED

I would appreciate your help.

Update

The initial data frame has duplicates in column B but the outcome must be the same.

Column A    Column B
MD223       GATE IN
MD223       GATE OUT
MD223       LOADED
MD223       DEPARTURE
MD223       LOADED
SC511       GATE IN
SC511       LOADED
SC511       SHIPPED
SC511       LOADED
KR977       DISPATCHED
KR977       LOADED
KR977       SHIPPED

Expected outcome:

Column A    Column B
MD223       GATE IN
MD223       GATE OUT
MD223       LOADED
SC511       GATE IN
SC511       LOADED
KR977       DISPATCHED
KR977       LOADED

Solution

  • Lets do cummax per group in Column A to flag the all the rows before the first LOADED the use loc to select all the flagged rows

    mask = df['Column B'] == 'LOADED' 
    df.loc[mask[::-1].groupby(df['Column A']).cummax()]
    

    Update: If you have duplicates in Column B

    mask = df['Column B'].eq('LOADED') & ~df.duplicated(subset=['Column A', 'Column B'])
    df.loc[mask[::-1].groupby(df['Column A']).cummax()]
    

    Result

      Column A    Column B
    0    MD223     GATE IN
    1    MD223    GATE OUT
    2    MD223      LOADED
    4    SC511     GATE IN
    5    SC511      LOADED
    7    KR977  DISPATCHED
    8    KR977      LOADED