Search code examples
pythonapache-sparkpysparkuser-defined-functionsazure-databricks

When to use a UDF versus a function in PySpark?


I'm using Spark with Databricks and have the following code:

def replaceBlanksWithNulls(column):
    return when(col(column) != "", col(column)).otherwise(None)

Both of these next statements work:

x = rawSmallDf.withColumn("z", replaceBlanksWithNulls("z"))

and using a UDF:

replaceBlanksWithNulls_Udf = udf(replaceBlanksWithNulls)
y = rawSmallDf.withColumn("z", replaceBlanksWithNulls_Udf("z"))

It is unclear to me from the documentation when I should use one over the other and why?


Solution

  • An UDF can essentially be any sort of function (there are exceptions, of course) - it is not necessary to use Spark structures such as when, col, etc. By using an UDF the replaceBlanksWithNulls function can be written as normal python code:

    def replaceBlanksWithNulls(s):
        return "" if s != "" else None
    

    which can be used on a dataframe column after registering it:

    replaceBlanksWithNulls_Udf = udf(replaceBlanksWithNulls)
    y = rawSmallDf.withColumn("z", replaceBlanksWithNulls_Udf("z"))
    

    Note: The default return type of an UDF is strings. If another type is required that must be specified when registering it, e.g.

    from pyspark.sql.types import LongType
    squared_udf = udf(squared, LongType())
    

    In this case, the column operation is not complex and there are Spark functions that can acheive the same thing (i.e. replaceBlanksWithNulls as in the question:

    x = rawSmallDf.withColumn("z", when(col("z") != "", col("z")).otherwise(None))
    

    This is always prefered whenever possible since it allows Spark to optimize the query, see e.g. Spark functions vs UDF performance?