I'm trying to filter out the null values in a column and count if its greater than 1.
badRows = df.filter($"_corrupt_record".isNotNull) if badRows.count > 0: logger.error("throwing bad rows exception...") schema_mismatch_exception(None, "cdc", item )
I'm getting a syntax error. Also tried to check using :
badRows = df.filter(col("_corrupt_record").isNotNull),
badRows = df.filter(None, col("_corrupt_record")),
badRows = df.filter("_corrupt_record isNotnull")
What is the correct way to filter out if there is data in the _corrupt_record column
Try, e.g.
import pyspark.sql.functions as F
...
df.where(F.col("colname").isNotNull())
...
Many of the options you provide are not the right syntax as you note.