I have a Pandas DataFrame:
msg_id | identifier |
---|---|
001 | Stackoverflow |
001 | Stackoverflow |
002 | Stackoverflow |
002 | Cross-Validated |
I want to drop the duplicated values in identifier
for each unique value of msg_id
This is my current apporach which is super slow:
acc_df = pd.DataFrame(columns = df.columns)
for _, group in df.groupby("msg_id"):
df = group[group.duplicated("identifier")]
if len(df) > 0:
acc_df = pd.concat([df, acc_df], axis=0, ignore_index=False)
acc_df
I have a very large dataset with 500 million rows. Even after filtering for the msg_id
that has more than one identifier
comes at the very large number.
I am looking for any vectorized or faster apporach NOT INCLUDING Multi-Processing and Threading
Code
The problem is to find rows where the values of two columns are duplicated, not grouped. This is possible as follows.
df[df.duplicated(['msg_id', 'identifier'])]