Search code examples
dataframescalapysparkapache-spark-sql

PySpark df.na.drop() vs. df.dropna()


I would like to remove rows from my PySpark df where there are null values in any of the columns, but it is taking a really long time to run when using df.dropna(). Is there any benefit performance wise to using df.na.drop() instead?

I like using df.dropna() because I can specify which columns to look for null values, but I am finding that it is still very slow (the data frame has millions of rows, so this could be why)...


Solution

  • According to spark official documentation, DataFrame.dropna() and DataFrameNaFunctions.drop() are aliases of each other. So theoretically their efficiency should be equivalent.

    In addition, df.na.drop() can also specify a subset.