Search code examples
apache-sparkpysparkapache-spark-sqlnull

Incomprehensible result of a comparison between a string and null value in PySpark


Can anyone explain to me the behavior of a comparison between a string and a null value.

DataFrame:

import pyspark.sql.functions as F

df = spark.createDataFrame([
["text_1,","text_1"],
["text_1,","text_2"],
["text_1,",None]
]).toDF("col1", "col2")


df1 = df.withColumn("equal", F.when(F.col("col1") == F.col("col2"), "equal").otherwise("not equal")
   
+-------+-------+----------+
|col1   |col2   |equal     |
+-------+-------+----------+
|text_1 |text_1 |equal     |
|text_1 |text_2 |not equal |
|text_1 |null   |not equal |  <*
+-------+-------+----------+


df2 = df.withColumn("equal", F.when(F.col("col1") != F.col("col2"), "equal").otherwise("not equal")

+-------+-------+----------+
|col1   |col2   |equal     |
+-------+-------+----------+
|text_1 |text_1 |equal     |
|text_1 |text_2 |not equal |
|text_1 |null   |equal     |   <*
+-------+-------+----------+

The is equal comparison seems to go well but the is not equal goes wrong.

Can anyone explain this to me and how can I solve this without checking on .isNotNull or fill the null values with an empty string (if possible).


Solution

  • The reason why you got equal for comparison with null is because text1 != null gives null, which is interpreted as false by the when statement, so you got the unexpected equal from the otherwise statement.

    You can use eqNullSafe, which returns False instead of null when one of the column is null. If you want to compare inequality, use the negation ~ of eqNullSafe.

    import pyspark.sql.functions as F
    
    df3 = df.withColumn("equal",
        F.when(~F.col("col1").eqNullSafe(F.col("col2")), "not equal")
         .otherwise("equal")
    )
    
    df3.show()
    +------+------+---------+
    |  col1|  col2|    equal|
    +------+------+---------+
    |text_1|text_1|    equal|
    |text_1|text_2|not equal|
    |text_1|  null|not equal|
    +------+------+---------+
    

    If you want to fill the null values with an empty string, you can use coalesce:

    import pyspark.sql.functions as F
    
    df4 = df.withColumn("equal",
        F.when(F.col("col1") != F.coalesce(F.col("col2"), F.lit("")), "not equal")
         .otherwise("equal")
    )
    
    df4.show()
    +------+------+---------+
    |  col1|  col2|    equal|
    +------+------+---------+
    |text_1|text_1|    equal|
    |text_1|text_2|not equal|
    |text_1|  null|not equal|
    +------+------+---------+