Search code examples
pythonpandasapache-sparkspark-streamingaws-glue

How to filter remove null values in spark python


I'm trying to filter out the null values in a column and count if its greater than 1.

badRows = df.filter($"_corrupt_record".isNotNull) if badRows.count > 0: logger.error("throwing bad rows exception...") schema_mismatch_exception(None, "cdc", item )

I'm getting a syntax error. Also tried to check using :

badRows = df.filter(col("_corrupt_record").isNotNull), badRows = df.filter(None, col("_corrupt_record")), badRows = df.filter("_corrupt_record isNotnull")

What is the correct way to filter out if there is data in the _corrupt_record column


Solution

  • Try, e.g.

    import pyspark.sql.functions as F
    ...
    df.where(F.col("colname").isNotNull()) 
    ...
    

    Many of the options you provide are not the right syntax as you note.