I am encountering an error when attempting to use dbutils.notebook.run()
that I don't encounter when using the %run
command in what to my eyes is an equivalent fashion. I'm hoping I am just missing something, but I can't for the life of me see what it might be.
I have a Databricks "utility" notebook (configure-storage
) that configures a connection to an Azure Data Lake Storage gen2 (ADLS) account. It takes several parameters, some of which are Key Vault secret names that are used to retrieve the actual secret values for configuring the storage connection:
# Notebook parameters
dbutils.widgets.text("storage_account","")
dbutils.widgets.text("tenant_id","")
dbutils.widgets.text("client_id","")
dbutils.widgets.text("client_secret","")
# Set storage account and get secrets from Key Vault
storage_account = dbutils.widgets.get("storage_account")
tenant_id = dbutils.secrets.get(scope="key-vault",key=dbutils.widgets.get("tenant_id"))
client_id = dbutils.secrets.get(scope="key-vault",key=dbutils.widgets.get("client_id"))
client_secret = dbutils.secrets.get(scope="key-vault",key=dbutils.widgets.get("client_secret"))
# Azure Data Lake Storage auth
spark.conf.set(f"fs.azure.account.auth.type.{storage_account}.dfs.core.windows.net", "OAuth")
spark.conf.set(f"fs.azure.account.oauth.provider.type.{storage_account}.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set(f"fs.azure.account.oauth2.client.id.{storage_account}.dfs.core.windows.net", f"{client_id}")
spark.conf.set(f"fs.azure.account.oauth2.client.secret.{storage_account}.dfs.core.windows.net", client_secret)
spark.conf.set(f"fs.azure.account.oauth2.client.endpoint.{storage_account}.dfs.core.windows.net", f"https://login.microsoftonline.com/{tenant_id}/oauth2/token")
For illustration/troubleshooting, in the calling notebook I am just performing a simple read of a delta table:
file_location = "abfss://<storage-container>@<storage-account>.dfs.core.windows.net/<path-to-delta-table>"
df = spark.read.format("delta").load(file_location)
display(df)
If in the calling notebook I use the %run
command as follows, the above interaction with the ADLS account works just fine:
%run "../util/configure-storage" $storage_account="storage-account-name" $tenant_id="tenant-id-secret-name" $client_id="client-id-secret-name" $client_secret="client-secret-secret-name"
However, if I use dbutils.notebook.run()
as follows...
dbutils.notebook.run(
"../util/configure-storage", 60,
{"storage_account": "storage-account-name",
"tenant_id": "tenant-id-secret-name",
"client_id": "client-id-secret-name",
"client_secret": "client-secret-secret-name"})
...then the above interaction with the ADLS account results in the following error:
Py4JJavaError: An error occurred while calling o1442.load.
: Failure to initialize configuration for storage account <storage-account>.dfs.core.windows.net: Invalid configuration value detected for fs.azure.account.keyInvalid configuration value detected for fs.azure.account.key
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:52)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:666)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:2055)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.<init>(AzureBlobFileSystemStore.java:267)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:225)
at com.databricks.common.filesystem.LokiFileSystem$.$anonfun$getLokiFS$1(LokiFileSystem.scala:63)
at com.databricks.common.filesystem.Cache.getOrCompute(Cache.scala:38)
at com.databricks.common.filesystem.LokiFileSystem$.getLokiFS(LokiFileSystem.scala:60)
at com.databricks.common.filesystem.LokiFileSystem.initialize(LokiFileSystem.scala:86)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:537)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at com.databricks.sql.transaction.tahoe.DeltaValidation$.validateDeltaRead(DeltaValidation.scala:102)
at org.apache.spark.sql.DataFrameReader.preprocessDeltaLoading(DataFrameReader.scala:280)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:329)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:306)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
at java.lang.Thread.run(Thread.java:750)
Caused by: Invalid configuration value detected for fs.azure.account.key
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.diagnostics.ConfigurationBasicValidator.validate(ConfigurationBasicValidator.java:49)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.diagnostics.Base64StringConfigurationBasicValidator.validate(Base64StringConfigurationBasicValidator.java:40)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.validateStorageAccountKey(SimpleKeyProvider.java:71)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:49)
I can certainly use %run
, but I am really perplexed as to why the behavior is different when using dbutils.notebook.run()
and would like to understand what I might be missing.
The main difference between %run
and dbutils.notebook.run
is that the former is like #include
in C/C++ - it includes all definitions from the referenced notebook into the current execution context so it's available for your caller notebook. While the latter is executing a given notebook as a separate job, and changes made there aren't propagated to the current execution context.
P.S. It's really described in the documentation.