Can anyone explain to me the behavior of a comparison between a string and a null value.
DataFrame:
import pyspark.sql.functions as F
df = spark.createDataFrame([
["text_1,","text_1"],
["text_1,","text_2"],
["text_1,",None]
]).toDF("col1", "col2")
df1 = df.withColumn("equal", F.when(F.col("col1") == F.col("col2"), "equal").otherwise("not equal")
+-------+-------+----------+
|col1 |col2 |equal |
+-------+-------+----------+
|text_1 |text_1 |equal |
|text_1 |text_2 |not equal |
|text_1 |null |not equal | <*
+-------+-------+----------+
df2 = df.withColumn("equal", F.when(F.col("col1") != F.col("col2"), "equal").otherwise("not equal")
+-------+-------+----------+
|col1 |col2 |equal |
+-------+-------+----------+
|text_1 |text_1 |equal |
|text_1 |text_2 |not equal |
|text_1 |null |equal | <*
+-------+-------+----------+
The is equal
comparison seems to go well but the is not equal
goes wrong.
Can anyone explain this to me and how can I solve this without checking on .isNotNull
or fill the null
values with an empty string (if possible).
The reason why you got equal
for comparison with null is because text1 != null
gives null, which is interpreted as false by the when
statement, so you got the unexpected equal
from the otherwise
statement.
You can use eqNullSafe
, which returns False instead of null when one of the column is null. If you want to compare inequality, use the negation ~
of eqNullSafe
.
import pyspark.sql.functions as F
df3 = df.withColumn("equal",
F.when(~F.col("col1").eqNullSafe(F.col("col2")), "not equal")
.otherwise("equal")
)
df3.show()
+------+------+---------+
| col1| col2| equal|
+------+------+---------+
|text_1|text_1| equal|
|text_1|text_2|not equal|
|text_1| null|not equal|
+------+------+---------+
If you want to fill the null values with an empty string, you can use coalesce
:
import pyspark.sql.functions as F
df4 = df.withColumn("equal",
F.when(F.col("col1") != F.coalesce(F.col("col2"), F.lit("")), "not equal")
.otherwise("equal")
)
df4.show()
+------+------+---------+
| col1| col2| equal|
+------+------+---------+
|text_1|text_1| equal|
|text_1|text_2|not equal|
|text_1| null|not equal|
+------+------+---------+