I have a dataframe which looks something like this where I have already sorted it based on the Page Count
and the column Split Parties
contains list of elements
DocSetID | Duplicate Status | Page Count | split_counterparties |
---|---|---|---|
8032866 | Primary | 39 | ['Bass Pro LLC'] |
8032866 | Duplicate | 39 | ['Bass Pro LLC'] |
8040900 | Primary | 39 | ['Bass Pro LLC'] |
8009326 | Primary | 39 | ['Blackrock Inc'] |
8014340 | Primary | 39 | ['Booking Holdings Inc'] |
8010961 | Primary | 39 | ['Cadence Design Systems Inc'] |
8010932 | Primary | 39 | ['Cadence Design Systems Inc'] |
8019492 | Primary | . 39 | ['Cartavi LLC'] |
I want only to select the rows where the Split parties
are repeated. So for example based on these criteria I would get back
So far I have gotten to sorting the dataframe by Page count and I am really stuck on how to get this further filtering based on similar list elements in successive rows.
I would really appreciate some help or guidance in this regard.
Edit: Added more relevant data examples
IIUC, you can use .duplicated
:
mask = df.duplicated(subset="Split parties", keep=False)
print(df.loc[mask, "Doc id"].to_list())
Prints:
[1, 2, 3, 6]