Search code examples
python-2.7dataframepysparkapache-spark-sql

check for duplicates in Pyspark Dataframe


Is there a simple and efficient way to check a python dataframe just for duplicates (not drop them) based on column(s)?

I want to check if a dataframe has dups based on a combination of columns and if it does, fail the process.

TIA.


Solution

  • The easiest way would be to check if the number of rows in the dataframe equals the number of rows after dropping duplicates.

    if df.count() > df.dropDuplicates([listOfColumns]).count():
        raise ValueError('Data has duplicates')