Search code examples
pythonpysparkbooleanevaluationshort-circuit-evaluation

Does PySpark support the short-circuit evaluation of conditional statements?


I want to create a new boolean column in my dataframe that derives its value from the evaluation of two conditional statements on other columns in the same dataframe:

columns = ["id", "color_one", "color_two"]
data = spark.createDataFrame([(1, "blue", "red"), (2, "red", None)]).toDF(*columns)

data = data.withColumn('is_red', data.color_one.contains("red") | data.color_two.contains("red"))

This works fine unless either color_one or color_two is NULL in a row. In cases like these, is_red is also set to NULL for that row instead of true or false:

+-------+----------+------------+-------+
|id     |color_one |color_two   |is_red |
+-------+----------+------------+-------+
|      1|      blue|         red|   true|
|      2|       red|        NULL|   NULL|
+-------+----------+------------+-------+

This means that PySpark is evaluating all of the clauses of the conditional statement rather than exiting early (via short-circuit evaluation) if the first condition happens to be true (like in row 2 of my example above).

Does PySpark support the short-circuit evaluation of conditional statements?

In the meantime, here is a workaround I have come up with to null-check each column:

from pyspark.sql import functions as F

color_one_is_null = data.color_one.isNull()
color_two_is_null = data.color_two.isNull()
data = data.withColumn('is_red', F.when(color_two_is_null, data.color_one.contains("red"))
                                  .otherwise(F.when(color_one_is_null, data.color_two.contains("red"))
                                              .otherwise(F.when(color_one_is_null & color_two_is_null, F.lit(False))
                                                          .otherwise(data.color_one.contains("red") | data.color_two.contains("red"))))
                      )

Solution

  • I don't think Spark support short-circuit evaluation on conditionals as stated here https://docs.databricks.com/spark/latest/spark-sql/udf-python.html#:~:text=Spark%20SQL%20(including,short-circuiting%E2%80%9D%20semantics.:

    Spark SQL (including SQL and the DataFrame and Dataset API) does not guarantee the order of evaluation of subexpressions. In particular, the inputs of an operator or function are not necessarily evaluated left-to-right or in any other fixed order. For example, logical AND and OR expressions do not have left-to-right “short-circuiting” semantics.

    Another alternative way would be creating an array of column_one and column_two, then evaluate if the array contains 'red' using SQL EXISTS

    data = data.withColumn('is_red', F.expr("EXISTS(array(color_one, color_two), x -> x = 'red')"))
    data.show()
    +---+---------+---------+------+
    | id|color_one|color_two|is_red|
    +---+---------+---------+------+
    |  1|     blue|      red|  true|
    |  2|      red|     null|  true|
    |  3|     null|    green| false|
    |  4|   yellow|     null| false|
    |  5|     null|      red|  true|
    |  6|     null|     null| false|
    +---+---------+---------+------+