Search code examples
pythonpandasduplicatesaggregation

Finding Duplicated Values in Pandas Groupby Object


I have a Pandas DataFrame:

msg_id identifier
001 Stackoverflow
001 Stackoverflow
002 Stackoverflow
002 Cross-Validated

I want to drop the duplicated values in identifier for each unique value of msg_id

This is my current apporach which is super slow:

acc_df = pd.DataFrame(columns = df.columns)
for _, group in df.groupby("msg_id"):
    df = group[group.duplicated("identifier")]
    if len(df) > 0:
        acc_df = pd.concat([df, acc_df], axis=0, ignore_index=False)
acc_df

I have a very large dataset with 500 million rows. Even after filtering for the msg_id that has more than one identifier comes at the very large number.

I am looking for any vectorized or faster apporach NOT INCLUDING Multi-Processing and Threading


Solution

  • Code

    The problem is to find rows where the values of two columns are duplicated, not grouped. This is possible as follows.

    df[df.duplicated(['msg_id', 'identifier'])]