dataframe scala pyspark apache-spark-sql

PySpark df.na.drop() vs. df.dropna()

I would like to remove rows from my PySpark df where there are null values in any of the columns, but it is taking a really long time to run when using df.dropna(). Is there any benefit performance wise to using df.na.drop() instead?

I like using df.dropna() because I can specify which columns to look for null values, but I am finding that it is still very slow (the data frame has millions of rows, so this could be why)...

Solution

According to spark official documentation, DataFrame.dropna() and DataFrameNaFunctions.drop() are aliases of each other. So theoretically their efficiency should be equivalent.

In addition, df.na.drop() can also specify a subset.

Create a new column with the first value that matches a condition
Filter list using another list as a boolean mask in polars
polars use Expression API with DataFrame's rows
change the values of a polars Dataframe column
How to simply add a column level to a pandas dataframe
How to calculate the Exponential Moving Average (EMA) through record iterations in pandas dataframe
Retain empty lists when unnesting columns in pandas
How can I nest my data frame based on identical values in a column?
Is there anyway to ungroup data in a grouped-by pandas dataframe?
Pandas dataframe.set_index() deletes previous index and column
Data transformation on pandas dataframe to connect related rows based on shared values
Masking a polars dataframe for complex operations
chang pandas dataframe from long to wide
How to fill blank cells created by join but keep original null in pandas
Convert a list of time string to unique string format
Assignin lists as elements of CUDF DataFrame
Find unique values for all the columns of a dataframe
r dataframe and matrix result in different rownames when using rbind
Pandas Price Analysis
Groupby by sum of revenue and the corresponding highest contributing month - Pandas
Select rows from DataFrame where ID count is greater than X
fill nearest value in a column when null of pandas data frame
Calculate transition probabilities
Create new variable based on partial string matching in column R
How to select rows between a certain date range in python-polars?
Sample from each group in polars dataframe?
how to Send dataframe as html table with font styling based on text value as a email attachment
How to subset R dataframe based on specific values in several columns?
Unable to concatenate dataframes in streamlit
Take min and max dates for a sequence along a column