pyspark apache-spark-sql databricks azure-databricks databricks-sql

Databricks - [UNRESOLVED_ROUTINE] Cannot resolve function `md5` on search path

I have an error in Azure Databricks. It's not possible to find built-in function.

[UNRESOLVED_ROUTINE] Cannot resolve function md5 on search path [system.builtin, system.session, spark_catalog.default]

It's happening only sometimes (not always) in a workflow, after rerun all is fine. Can be caused due to the fact that I have many tasks which use the same notebook but with different parameters. Do you know how to resolve? Some init script on the job cluster level or add libraries on the task level?

Solution

This error indicates that the md5 function cannot be resolved in the default search path. The md5 function is available in the pyspark.sql.functions module, so you need to import it before using it in your code. I have tried the below example:

from pyspark.sql.functions import md5
df_md5 = df.select("color", md5("color").alias("md5_hash"))
df_md5.show()

+-----+--------------------+
|color|            md5_hash|
+-----+--------------------+
|    E|3a3ea00cfc35332ce...|
|    E|3a3ea00cfc35332ce...|
|    E|3a3ea00cfc35332ce...|
|    I|dd7536794b63bf90e...|
|    J|ff44570aca8241914...|
|    J|ff44570aca8241914...|

In the above code, I have imported the md5 function and applied the md5 function to a column in a DataFrame.

In SQL, the md5 function is not built-in. However, you can still use it by registering a temporary SQL function using the registerTempFunction method.

I have tried the below example:

import hashlib
from pyspark.sql.functions import udf
spark.udf.register("md5", lambda x: hashlib.md5(x.encode('utf-8')).hexdigest(), StringType())
df.createOrReplaceTempView("table1")
result = spark.sql("SELECT color, md5(color) as md5_hash FROM table1")
result.show()

Results:

+------+--------------------+
| color|            md5_hash|
+------+--------------------+
|   red|bda9643ac6601722a...|
|  blue|48d6215903dff5623...|
| green|9f27410725ab8cc88...|
|yellow|d487dd0b55dfcacdd...|
+------+--------------------+

I have imported the module, in this case, the hashlib module for calculating the MD5 hash and the StringType class from pyspark.sql.types.

In the above code, we are registering a temporary SQL function called "md5" that takes a string input and returns the MD5 hash as a string. Then, I created a temporary view of the DataFrame df and used the registered function in a SQL query.