I have a log table which contains changes. Sign + means addition, sign - means deletion.
import pandas as pd
history = pd.DataFrame({
"First":
["X","X", "Y", "Y", "X", "X", "Y", "Z"],
"Last":
["Y", "X", "Y", "Y", "X", "X", "Y", "A"],
"Change":
["+", "+", "-", "+", "-", "+", "+", "-"],
"Date":
["2022-05-01", "2024-05-01", "2024-06-01", "2024-06-01",
"2024-05-03", "2024-05-02", "2024-06-02", "2024-06-01"]
})
history = history.sort_values(by=["Date", "Change"])
# sort needed to process the entries chronologically
This produces
First Last Change Date
0 X Y + 2022-05-01
1 X X + 2024-05-01
5 X X + 2024-05-02
4 X X - 2024-05-03
3 Y Y + 2024-06-01
2 Y Y - 2024-06-01
7 Z A - 2024-06-01
6 Y Y + 2024-06-02
In next step, I want to display only what is currently available.
I build this logic using iteration which is very slow. Basically
latest = {}
item_columns = [
"First",
"Last",
]
for _, row in history.iterrows():
key = tuple(row[column] for column in item_columns)
if row["Change"] == "+":
latest[key] = row
elif row["Change"] == "-" and key in latest:
del latest[key]
available = pd.DataFrame(latest.keys(), columns=item_columns)
This produces available items
First Last
0 X Y
1 Y Y
The issue is that loop is slow with big tables, e.g. 20 seconds for below
latest = {}
item_columns = [
"First",
"Last",
]
duplicated = pd.concat([history.iloc[[1]]] * 50000, ignore_index=True)
history = pd.concat([history, duplicated], ignore_index=True)
for _, row in history.iterrows():
key = tuple(row[column] for column in item_columns)
if row["Change"] == "+":
latest[key] = row
elif row["Change"] == "-" and key in latest:
del latest[key]
available = pd.DataFrame(latest.keys(), columns=item_columns)
Any way to speed up?
If you want to keep the last +
row: filter the rows with +
, then drop_duplicates
:
out = (history.query('Change == "+"')
.drop_duplicates(subset=['First', 'Last'], keep='last')
)
Output:
First Last Change Date
0 X Y + 2022-05-01
5 X X + 2024-05-02
6 Y Y + 2024-06-02
If you want to keep the row per combination, only if it is a +
, reverse the operations:
out = (history.drop_duplicates(subset=['First', 'Last'], keep='last')
.query('Change == "+"')
)
Output:
First Last Change Date
0 X Y + 2022-05-01
6 Y Y + 2024-06-02