Search code examples
pythonpandasgroup-by

How do I quickly drop rows based on the max value in a groupby?


I have a large dataframe containing information on people and their job change history. Sometimes, someone had multiple changes to their record on one day, each of which is assigned a transaction sequence number. I just want to keep the rows with the highest transaction sequence number of that day. Currently, I'm using the for loop below to do this, but it takes forever.

list_indexes_to_drop = []
for (associate_id, date), df in df_job_his.groupby(["Employee ID", "Event Date"]):
    if len(df) > 1:
        list_indexes_to_drop += list(df.index[df["Transaction Sequence Number"] != df["Transaction Sequence Number"].max()])

I also have this code below, but I'm not sure how to use it to filter the dataframe.

df_job_his.groupby(["Employee ID", "Event Date"])["Transaction Sequence Number"].max()

Is there a more efficient way to go about this?

Here's an example of some random data in the same format:

df_job_his = pd.DataFrame({"Employee ID": [1, 1, 1, 2, 3, 3, 4, 4, 5, 6, 6, 6, 7, 8, 9, 9, 10], "Event Date": ["2020-04-05", "2020-06-08", "2020-06-08", "2022-09-01", "2022-02-15", "2022-02-15", "2021-07-29", "2021-07-29", "2021-08-14", "2021-09-14", "2022-01-04", "2022-01-04", "2022-01-04", "2022-04-04", "2020-08-13", "2020-08-13", "2020-03-17"], "Transaction Sequence Number": [1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1]}).groupby(["Employee ID", "Event Date"])

Solution

  • It you groupby was almost a correct answer!

    A trick to get the value with the highest "Transaction Sequence Number", would be to use .groupby.last() after sorting the dataframe by Transaction Sequence Number

    Here's a solution:

    import pandas as pd
    import numpy as np
    df_job_his = pd.DataFrame({
        'Employee ID': [0, 0, 0, 0, 1, 1, 1],
        'Event Date': [1, 2, 3, 3, 1, 2, 3],
        'Transaction Sequence Number': [1, 2, 4, 3, 5, 6, 7],
        'Important info about transaction': np.random.random(7)
    })
    
    df_job_his.sort_values('Transaction Sequence Number').groupby(
        ["Employee ID", "Event Date"]).last()
    
    

    It outputs something like this, where the employee o for date 3 gets the last row only.

    (Employee ID,Event Date) Transaction Sequence Number Important info about transaction
    (0, 1) 1 0.00571239
    (0, 2) 2 0.0484783
    (0, 3) 4 0.958739
    (1, 1) 5 0.0690461
    (1, 2) 6 0.721041
    (1, 3) 7 0.763681