Search code examples
pythonapache-sparkpysparkapache-spark-sqldatabricks

What is the behaviour/internal working of a normal python function in Spark/Databricks


Hope you are all doing well.

Once again request your help in understanding a very small concept which continues to confuse me.

Say I have a databricks notebook with few cells. In cell 1, I have a small python function

from dateutil import tz
def getCurTsZn(z):
    tz = tz.gettz(z)
    ts = datetime.now(tz)
    return ts

Now this function is called in the subsequent cells say within normal python/pyspark code. The following are some of the questions I have.

  1. Is the function a User Defined Function ? Or only functions written in the format mentioned here be considered as a UDF in Spark/Databricks.
  2. How does this function work internally ? When invoked in a python code in subsequent cells, I read something about data going to the code and causing some performance issue ?
  3. I read that UDFs are black boxes which causes the optimizer to restrict its optimization at before UDF and after UDF. Does even such simple function (not registered) also act like a black box and hinder optimizations ?
  4. I understand that the function needs to be registered for it to be used within spark.sql("SELECT getCurTsZn("some zone")") but if it is not needed, does registering actually make a difference ? Already looked at linked post. But not looking at Spark level optimized functions...just simple function like the one mentioned above.
  5. Where does vectorized Python UDFs make an appearance ? I understand that vectorized python udf works on multiple rows instead of a single row. So does creating a function and passing a dataframe to it and acting upon it make a vectorized function ? I hope to have a better understanding of the basics with your kind help. Thank you, Cheers...

Solution

  • Until you declare/register this function as UDF, it will be executed as a normal Python function only on the driver node. Databricks attaches a separate Python REPL to each notebook, and each cell is evaluated in the context of that REPL.

    Only when you declare/register the function as UDF, it will start to execute on the executor nodes as well. As you mentioned, there are "normal" and vectorized UDFs. Both types have the same problem - the data to be processed should be transferred from JVM memory to the Python interpreter memory. Vectorized UDFs have better performance primarily because they avoid necessary serialization, but overhead still exists (see blog post 1, 2). And both UDF types are still considered as black boxes because Spark doesn't know what happens inside and can't apply different optimizations to the code.