I am using PySpark (inside an Jupyter Notebook, which connects to a Spark-cluster) and some UDFs. The UDF takes a list as the an additional parameter and I construct the UDF like this:
my_udf = F.udf(partial(my_normal_fn, list_param=list), StringType())
Everything works fine, with regard to executing the function. But I noticed that the UDF is never updated.
To clarify: When I update the list
, for example by altering a element in the list, the UDF is not updated. The old version with the old list is still used. Even if I execute the whole notebook again.
I have to restart the Jupyter Kernel in order to use new version of the list. Which is really annoying...
Any thoughts?
I found the solution.
My my_normal_fn
did have the following signature:
def my_normal_fn(x, list_param=[]):
dosomestuffwith_x_and_list_param
Changing it to
def my_normal_fn(x, list_param):
dosomestuffwith_x_and_list_param
Did the trick. See here for more information.
Thanks to user Drjones78 of the SparkML-Slack channel.