I would like to remove rows from my PySpark df where there are null values in any of the columns, but it is taking a really long time to run when using df.dropna()
. Is there any benefit performance wise to using df.na.drop()
instead?
I like using df.dropna()
because I can specify which columns to look for null values, but I am finding that it is still very slow (the data frame has millions of rows, so this could be why)...
According to spark official documentation, DataFrame.dropna()
and DataFrameNaFunctions.drop()
are aliases of each other. So theoretically their efficiency should be equivalent.
In addition, df.na.drop() can also specify a subset.