Search code examples
dataframescalaapache-sparkapache-spark-sqlnullpointerexception

Filter out and log null values from Spark dataframe


I have this dataframe :

+------+-------------------+-----------+
|brand |original_timestamp |weight     |
+------+-------------------+-----------+
|BR1   |1632899456         |4.0        |
|BR2   |1632899456         |null       |
|BR3   |1632899456         |2.0        |
|BR4   |1632899155         |2.0        |
|BR5   |1632899155         |null       |

I would like to drop (filter out) null values and print a message such as :

"Weight for brand BR2 is null, dropping it from the data"

"Weight for brand BR5 is null, dropping it from the data"

I am using Spark version 3.2.2 and SQLContext, with scala language.


Solution

  • You have to use filter function with isNull & isNotNull to get desired output

    Filter null values using isNull

    df.filter($"weight".isNull).show(false)
    
    +-----+------------------+------+
    |brand|original_timestamp|weight|
    +-----+------------------+------+
    |BR2  |1632899456        |NULL  |
    |BR5  |1632899155        |NULL  |
    +-----+------------------+------+
    

    Print message

    df.filter($"weight".isNull)
    .selectExpr("""
        concat(
           '\"Weight for brand ',
            brand, 
           ' is null, dropping it from the data\"'
        ) message
    """)
    .as[String]
    .collect
    .foreach(println)
    
    "Weight for brand BR2 is null, dropping it from the data"
    "Weight for brand BR5 is null, dropping it from the data"
    

    Filter not null values using isNotNull

    df.filter($"weight".isNotNull).show(false)
    
    +-----+------------------+------+
    |brand|original_timestamp|weight|
    +-----+------------------+------+
    |BR1  |1632899456        |4.0   |
    |BR3  |1632899456        |2.0   |
    |BR4  |1632899155        |2.0   |
    +-----+------------------+------+