Search code examples
dataframeapache-sparkpysparkuser-defined-functionsazure-databricks

Exceeding the limit of Apache Spark UDFs [UDF_MAX_COUNT_EXCEEDED]


I'm working on a dataproject in Azure Databricks and need to define a UDF:

@udf(returnType=IntegerType()) 
def ends_with_one(value, bit_position): 
    if bit_position + len(value) < 0: 
        return 0
    else: 
        return int(value[bit_position] == '1') 

spark.udf.register("ends_with_one", ends_with_one)

But somehow instead of registering the UDF once, it get's registered every time I call it:

df = df.withColumn('Ends_With_One', ends_with_one(col('Column_To_Check'), lit(-1)))

And after a few function calls I get the following error message:

[UDF_MAX_COUNT_EXCEEDED] Exceeded query-wide UDF limit of 5 UDFs (limited during public preview). Found 6. The UDFs were: `ends_with_one`,`ends_with_one`,`ends_with_one`,`ends_with_one`,`ends_with_one`,`ends_with_one`.

I thought maybe it could have something to do with Sparks lazy evaluation so i called

display(df)

right after the function calls because I had a lot of code without actually executing it. But this didn't solve anything. I also tried

df('ends_with_one', ends_with_one(col('Column_To_Check'), lit(-1))).rdd.count 

to force an execution but still the same error message.


Solution

  • So apparently in some cases Spark calls a UDF more than one time per call. Especially when you are referring to a certain column that you generated/edited with a UDF the UDF calls get more and more. Thanks to this thread (Spark UDF called more than once per record when DF has too many columns), I found a workaround. Since Spark 2.3 and newer there is a function provided for non-deterministic UDFs.

    val myUdf = udf(...).asNondeterministic()
    

    This makes sure the UDF is only called once.