I'm working on a dataproject in Azure Databricks and need to define a UDF:
@udf(returnType=IntegerType())
def ends_with_one(value, bit_position):
if bit_position + len(value) < 0:
return 0
else:
return int(value[bit_position] == '1')
spark.udf.register("ends_with_one", ends_with_one)
But somehow instead of registering the UDF once, it get's registered every time I call it:
df = df.withColumn('Ends_With_One', ends_with_one(col('Column_To_Check'), lit(-1)))
And after a few function calls I get the following error message:
[UDF_MAX_COUNT_EXCEEDED] Exceeded query-wide UDF limit of 5 UDFs (limited during public preview). Found 6. The UDFs were: `ends_with_one`,`ends_with_one`,`ends_with_one`,`ends_with_one`,`ends_with_one`,`ends_with_one`.
I thought maybe it could have something to do with Sparks lazy evaluation so i called
display(df)
right after the function calls because I had a lot of code without actually executing it. But this didn't solve anything. I also tried
df('ends_with_one', ends_with_one(col('Column_To_Check'), lit(-1))).rdd.count
to force an execution but still the same error message.
So apparently in some cases Spark calls a UDF more than one time per call. Especially when you are referring to a certain column that you generated/edited with a UDF the UDF calls get more and more. Thanks to this thread (Spark UDF called more than once per record when DF has too many columns), I found a workaround. Since Spark 2.3 and newer there is a function provided for non-deterministic UDFs.
val myUdf = udf(...).asNondeterministic()
This makes sure the UDF is only called once.