Search code examples
pythonpandasdataframedrop

How to drop all rows below string match dataframe


I have a dataframe and I interested only the data above string text = "purchase" by session. input dataframe

session Date action flag_purchase
T001 01-01-2021 00.01 click 1
T001 01-01-2021 00.15 play 1
T001 01-01-2021 02.15 pause 1
T001 01-01-2021 03.15 play 1
T001 01-01-2021 04.15 purchase 1
T001 02-01-2021 10.15 play 1
T001 02-01-2021 12.00 pause 1
T001 02-01-2021 13.15 play 1
T002 01-01-2021 00.01 play 0
T002 03-01-2021 00.15 play 0
T002 03-01-2021 02.15 pause 0
T002 03-01-2021 03.15 play 0

I want to drop all rows below action = "purchase", If all actions in session doesn’t have text match the session will keep all rows, so the output that I want looks like:

final result

session Date action flag_purchase
T001 01-01-2021 00.01 click 1
T001 01-01-2021 00.15 play 1
T001 01-01-2021 02.15 pause 1
T001 01-01-2021 03.15 play 1
T001 01-01-2021 04.15 purchase 1
T002 01-01-2021 00.01 play 0
T002 03-01-2021 00.15 play 0
T002 03-01-2021 02.15 pause 0
T002 03-01-2021 03.15 play 0

Solution

  • If I understand you correctly, then you can do the following:

    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame({"id":[1,1,1,1,2,2,2,2,3,3],
    "action":["pause","play","purchase","purchase","play","purchase","pause","play","play","pause"]})
    
    print(df)
    
    #   id  action
    # 0  1  pause
    # 1  1  play
    # 2  1  purchase
    # 3  1  purchase
    # 4  2  play
    # 5  2  purchase
    # 6  2  pause
    # 7  2  play
    # 8  3  play
    # 9  3  pause
    
    
    def get_idx(row):
        """
        Gets the first index of where "purchase" occurs, then 
        return the rows untill and incl that index
        """
    
        idx = np.argwhere(row.values=="purchase") #get index
        if idx.size>0: #check if it exists
            idx = idx[0][0]+1
            return row[:idx] #return the rows
        return row #else, return the original rows
    
    df_clean = df.groupby("id")["action"].apply(get_idx).reset_index(drop=False,level=0)
    
    #    id action
    # 0  1  pause
    # 1  1  play
    # 2  1  purchase
    # 4  2  play
    # 5  2  purchase
    # 8  3  play
    # 9  3  pause