apache-sparkpysparkapache-spark-sql

How to detect null column in pyspark


I have a dataframe defined with some null values. Some Columns are fully null values.

>> df.show()
+---+---+---+----+
|  A|  B|  C|   D|
+---+---+---+----+
|1.0|4.0|7.0|null|
|2.0|5.0|7.0|null|
|3.0|6.0|5.0|null|
+---+---+---+----+

In my case, I want to return a list of columns name that are filled with null values. My idea was to detect the constant columns (as the whole column contains the same null value).

this is how I did it:

nullCoulumns = [c for c, const in df.select([(min(c) == max(c)).alias(c) for c in df.columns]).first().asDict().items() if const] 

but this does no consider null columns as constant, it works only with values. How should I then do it ?


Solution

  • Extend the condition to

    from pyspark.sql.functions import min, max
    
    ((min(c).isNull() & max(c).isNull()) | (min(c) == max(c))).alias(c) 
    

    or use eqNullSafe (PySpark 2.3):

    (min(c).eqNullSafe(max(c))).alias(c)