Hope you are all doing well.
Once again request your help in understanding a very small concept which continues to confuse me.
Say I have a databricks notebook with few cells. In cell 1, I have a small python function
from dateutil import tz
def getCurTsZn(z):
tz = tz.gettz(z)
ts = datetime.now(tz)
return ts
Now this function is called in the subsequent cells say within normal python/pyspark code. The following are some of the questions I have.
spark.sql("SELECT getCurTsZn("some zone")")
but if it is not needed, does registering actually make a difference ? Already looked at linked post. But not looking at Spark level optimized functions...just simple function like the one mentioned above.Until you declare/register this function as UDF, it will be executed as a normal Python function only on the driver node. Databricks attaches a separate Python REPL to each notebook, and each cell is evaluated in the context of that REPL.
Only when you declare/register the function as UDF, it will start to execute on the executor nodes as well. As you mentioned, there are "normal" and vectorized UDFs. Both types have the same problem - the data to be processed should be transferred from JVM memory to the Python interpreter memory. Vectorized UDFs have better performance primarily because they avoid necessary serialization, but overhead still exists (see blog post 1, 2). And both UDF types are still considered as black boxes because Spark doesn't know what happens inside and can't apply different optimizations to the code.