Search code examples
pythonapache-sparkpysparkdata-cleaningdata-processing

How do I calculate the percentage of None or NaN values in Pyspark?


I have a larger data-set in PySpark and want to calculate the percentage of None/NaN values per column and store it in another dataframe called percentage_missing. For example if the following were the input dataframe:

df = sc.parallelize([
    (0.4, 0.3),
    (None, None),
    (9.7, None), 
    (None, None)
]).toDF(["A", "B"])

I would like the output to be a dataframe where column 'A' contains the value 0.5 and column 'B' contains the value 0.75.

I am looking for something like this:

for column_ in my_columns:
  amount_missing = df[df[column_] == None].count().div(len(df)) * 100

If there is a library with a function that does this I would also be happy to use it.


Solution

  • The following code do exacly what you asked:

    from pyspark.sql.functions import *
    
    df:
    
    +----+----+
    |   A|   B|
    +----+----+
    | 0.4| 0.3|
    |null|null|
    | 9.7|null|
    |null|null|
    +----+----+
    
    # Generic solution for all columns
    amount_missing_df = df.select([(count(when(isnan(c) | col(c).isNull(), c))/count(lit(1))).alias(c) for c in df.columns])
    amount_missing_df.show()
    
    amount_missing_df:
    
    +---+----+
    |  A|   B|
    +---+----+
    |0.5|0.75|
    +---+----+