I have a dataframe and I interested only the data above string text = "purchase" by session. input dataframe
session | Date | action | flag_purchase |
---|---|---|---|
T001 | 01-01-2021 00.01 | click | 1 |
T001 | 01-01-2021 00.15 | play | 1 |
T001 | 01-01-2021 02.15 | pause | 1 |
T001 | 01-01-2021 03.15 | play | 1 |
T001 | 01-01-2021 04.15 | purchase | 1 |
T001 | 02-01-2021 10.15 | play | 1 |
T001 | 02-01-2021 12.00 | pause | 1 |
T001 | 02-01-2021 13.15 | play | 1 |
T002 | 01-01-2021 00.01 | play | 0 |
T002 | 03-01-2021 00.15 | play | 0 |
T002 | 03-01-2021 02.15 | pause | 0 |
T002 | 03-01-2021 03.15 | play | 0 |
I want to drop all rows below action = "purchase", If all actions in session doesn’t have text match the session will keep all rows, so the output that I want looks like:
session | Date | action | flag_purchase |
---|---|---|---|
T001 | 01-01-2021 00.01 | click | 1 |
T001 | 01-01-2021 00.15 | play | 1 |
T001 | 01-01-2021 02.15 | pause | 1 |
T001 | 01-01-2021 03.15 | play | 1 |
T001 | 01-01-2021 04.15 | purchase | 1 |
T002 | 01-01-2021 00.01 | play | 0 |
T002 | 03-01-2021 00.15 | play | 0 |
T002 | 03-01-2021 02.15 | pause | 0 |
T002 | 03-01-2021 03.15 | play | 0 |
If I understand you correctly, then you can do the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({"id":[1,1,1,1,2,2,2,2,3,3],
"action":["pause","play","purchase","purchase","play","purchase","pause","play","play","pause"]})
print(df)
# id action
# 0 1 pause
# 1 1 play
# 2 1 purchase
# 3 1 purchase
# 4 2 play
# 5 2 purchase
# 6 2 pause
# 7 2 play
# 8 3 play
# 9 3 pause
def get_idx(row):
"""
Gets the first index of where "purchase" occurs, then
return the rows untill and incl that index
"""
idx = np.argwhere(row.values=="purchase") #get index
if idx.size>0: #check if it exists
idx = idx[0][0]+1
return row[:idx] #return the rows
return row #else, return the original rows
df_clean = df.groupby("id")["action"].apply(get_idx).reset_index(drop=False,level=0)
# id action
# 0 1 pause
# 1 1 play
# 2 1 purchase
# 4 2 play
# 5 2 purchase
# 8 3 play
# 9 3 pause