python apache-spark pyspark user-defined-functions azure-databricks

When to use a UDF versus a function in PySpark?

I'm using Spark with Databricks and have the following code:

def replaceBlanksWithNulls(column):
    return when(col(column) != "", col(column)).otherwise(None)

Both of these next statements work:

x = rawSmallDf.withColumn("z", replaceBlanksWithNulls("z"))

and using a UDF:

replaceBlanksWithNulls_Udf = udf(replaceBlanksWithNulls)
y = rawSmallDf.withColumn("z", replaceBlanksWithNulls_Udf("z"))

It is unclear to me from the documentation when I should use one over the other and why?

Solution

An UDF can essentially be any sort of function (there are exceptions, of course) - it is not necessary to use Spark structures such as when, col, etc. By using an UDF the replaceBlanksWithNulls function can be written as normal python code:

def replaceBlanksWithNulls(s):
    return "" if s != "" else None

which can be used on a dataframe column after registering it:

replaceBlanksWithNulls_Udf = udf(replaceBlanksWithNulls)
y = rawSmallDf.withColumn("z", replaceBlanksWithNulls_Udf("z"))

Note: The default return type of an UDF is strings. If another type is required that must be specified when registering it, e.g.

from pyspark.sql.types import LongType
squared_udf = udf(squared, LongType())

In this case, the column operation is not complex and there are Spark functions that can acheive the same thing (i.e. replaceBlanksWithNulls as in the question:

x = rawSmallDf.withColumn("z", when(col("z") != "", col("z")).otherwise(None))

This is always prefered whenever possible since it allows Spark to optimize the query, see e.g. Spark functions vs UDF performance?