Search code examples
pythonpython-3.xpandasdataframedata-analysis

Find all the rows in a pandas dataframe where list values repeat in a column


I have a dataframe which looks something like this where I have already sorted it based on the Page Count and the column Split Parties contains list of elements

DocSetID Duplicate Status Page Count split_counterparties
8032866 Primary 39 ['Bass Pro LLC']
8032866 Duplicate 39 ['Bass Pro LLC']
8040900 Primary 39 ['Bass Pro LLC']
8009326 Primary 39 ['Blackrock Inc']
8014340 Primary 39 ['Booking Holdings Inc']
8010961 Primary 39 ['Cadence Design Systems Inc']
8010932 Primary 39 ['Cadence Design Systems Inc']
8019492 Primary . 39 ['Cartavi LLC']

I want only to select the rows where the Split parties are repeated. So for example based on these criteria I would get back

  • rows containing Bass Pro LLC
  • rows containing Cadence Design Systems.

So far I have gotten to sorting the dataframe by Page count and I am really stuck on how to get this further filtering based on similar list elements in successive rows.

I would really appreciate some help or guidance in this regard.

Edit: Added more relevant data examples


Solution

  • IIUC, you can use .duplicated:

    mask = df.duplicated(subset="Split parties", keep=False)
    print(df.loc[mask, "Doc id"].to_list())
    

    Prints:

    [1, 2, 3, 6]