Search code examples
apache-sparkpysparkudf

UDF reload in PySpark


I am using PySpark (inside an Jupyter Notebook, which connects to a Spark-cluster) and some UDFs. The UDF takes a list as the an additional parameter and I construct the UDF like this:

my_udf = F.udf(partial(my_normal_fn, list_param=list), StringType())

Everything works fine, with regard to executing the function. But I noticed that the UDF is never updated. To clarify: When I update the list, for example by altering a element in the list, the UDF is not updated. The old version with the old list is still used. Even if I execute the whole notebook again. I have to restart the Jupyter Kernel in order to use new version of the list. Which is really annoying...

Any thoughts?


Solution

  • I found the solution.

    My my_normal_fn did have the following signature:

    def my_normal_fn(x, list_param=[]):
        dosomestuffwith_x_and_list_param
    

    Changing it to

    def my_normal_fn(x, list_param):
        dosomestuffwith_x_and_list_param
    

    Did the trick. See here for more information.

    Thanks to user Drjones78 of the SparkML-Slack channel.