Search code examples
apache-sparkpyspark

How to remove None values


I have a spark dataframe with a None value on first row.

df_spark.show()

enter image description here

I created the above dataframe initailly in pandas then converted to a spark dataframe:

df = pd.DataFrame(
    {
        'rid': ['A', 'B', 'C'],
        'num': [None, 8, 9],
        'availability_percent': [56, 69, 70],
        'availability_spaces': [7, 6, 5]
    }
)

Then:

df_spark = spark.createDataFrame(df)

When i do df_spark.filter(df_spark.num.isNotNull()).show()

i get the same dataframe above, meaning my row with Nan values was not remove. What i did wrong?

enter image description here


Solution

  • You can add a check for isNan to cover the case of NAN values

    from pyspark.sql.functions import isnan
    
    df_spark.filter(~isnan(df_spark.num) & df_spark.num.isNotNull()).show()