Search code examples
apache-sparkpysparkuser-defined-functions

How is possible that I can use some regular Python functions as UDF?


Consider following code:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType


def test(val):
   return val + 1


if __name__ == '__main__':
  ss = (SparkSession.builder.appName("Test").getOrCreate())

  data = [("James", 3000),
        ("Michael", 4005),
        ("Robert", 4005)]

  schema = (StructType([
    StructField("firstname", StringType(), True),
    StructField("salary", IntegerType(), True)
   ]))

  df = ss.createDataFrame(data=data, schema=schema)

  df.withColumn("test", test(col("salary"))).show()

How is it possible that it correctly outputs:

+---------+------+----+
|firstname|salary|test|
+---------+------+----+
|    James|  3000|3001|
|  Michael|  4005|4006|
|   Robert|  4005|4006|
+---------+------+----+

While test function is NOT defined as test =udf(test, FloatType()) and it's not registered using spark.udf.register()?

Why every tutorial then asks to define and register a UDF? How is it possible that the above just works?


Solution

  • Close to a duplicate of when-to-use-a-udf-versus-a-function-in-pyspark but the short answer is your function just returns the col("salary") + 1 (via Column add), which gets evaluated as Spark Expressions.

    See here for an explanation of the magic behind add.

    Wrapping it in udf forces col("salary") to be evaluated, sent to your udf (which adds one to it) and returned, this will be slower for large amounts of data or calculations than using the Column interface directly.