Is there a simple and efficient way to check a python dataframe just for duplicates (not drop them) based on column(s)?
I want to check if a dataframe has dups based on a combination of columns and if it does, fail the process.
TIA.
The easiest way would be to check if the number of rows in the dataframe equals the number of rows after dropping duplicates.
if df.count() > df.dropDuplicates([listOfColumns]).count():
raise ValueError('Data has duplicates')