I have a large dataframe containing information on people and their job change history. Sometimes, someone had multiple changes to their record on one day, each of which is assigned a transaction sequence number. I just want to keep the rows with the highest transaction sequence number of that day. Currently, I'm using the for loop below to do this, but it takes forever.
list_indexes_to_drop = []
for (associate_id, date), df in df_job_his.groupby(["Employee ID", "Event Date"]):
if len(df) > 1:
list_indexes_to_drop += list(df.index[df["Transaction Sequence Number"] != df["Transaction Sequence Number"].max()])
I also have this code below, but I'm not sure how to use it to filter the dataframe.
df_job_his.groupby(["Employee ID", "Event Date"])["Transaction Sequence Number"].max()
Is there a more efficient way to go about this?
Here's an example of some random data in the same format:
df_job_his = pd.DataFrame({"Employee ID": [1, 1, 1, 2, 3, 3, 4, 4, 5, 6, 6, 6, 7, 8, 9, 9, 10], "Event Date": ["2020-04-05", "2020-06-08", "2020-06-08", "2022-09-01", "2022-02-15", "2022-02-15", "2021-07-29", "2021-07-29", "2021-08-14", "2021-09-14", "2022-01-04", "2022-01-04", "2022-01-04", "2022-04-04", "2020-08-13", "2020-08-13", "2020-03-17"], "Transaction Sequence Number": [1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1]}).groupby(["Employee ID", "Event Date"])
It you groupby was almost a correct answer!
A trick to get the value with the highest "Transaction Sequence Number", would be to use .groupby.last() after sorting the dataframe by Transaction Sequence Number
Here's a solution:
import pandas as pd
import numpy as np
df_job_his = pd.DataFrame({
'Employee ID': [0, 0, 0, 0, 1, 1, 1],
'Event Date': [1, 2, 3, 3, 1, 2, 3],
'Transaction Sequence Number': [1, 2, 4, 3, 5, 6, 7],
'Important info about transaction': np.random.random(7)
})
df_job_his.sort_values('Transaction Sequence Number').groupby(
["Employee ID", "Event Date"]).last()
It outputs something like this, where the employee o for date 3 gets the last row only.
(Employee ID,Event Date) | Transaction Sequence Number | Important info about transaction |
---|---|---|
(0, 1) | 1 | 0.00571239 |
(0, 2) | 2 | 0.0484783 |
(0, 3) | 4 | 0.958739 |
(1, 1) | 5 | 0.0690461 |
(1, 2) | 6 | 0.721041 |
(1, 3) | 7 | 0.763681 |