I mounted my Azure Storage Account using dbutils and Python like in this page, with the method using Azure Service Principal: https://learn.microsoft.com/en-us/azure/databricks/dbfs/mounts
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<application-id>",
"fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}
# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)
but I also saw there is an option to do a connection with spark to the Azure Blob File System (ABFS) driver like in this page: https://learn.microsoft.com/en-us/azure/databricks/external-data/azure-storage
service_credential = dbutils.secrets.get(scope="<scope>",key="<service-credential-key>")
spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net", service_credential)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")
I couldn't find information about the difference? In which use cases is it better to use one or the other? Is one method faster than the other to get information from the stored data in the Azure Storage Account?
Thanks a lot in advance!
When you mount your storage account, you make it accessible to everyone that has access to your Databricks workspace.
But when you use spark.conf.set
to connect and use your storage account, it is limited to only those who have access to that cluster.
As highlighted in the same Microsoft document for Access Azure Data Lake Storage Gen2 and Blob Storage, Mounting is among the deprecated ways of accessing Storage accounts and no longer recommended. Therefore, as per the requirement, you can either choose mounting or setting configurations taking security into consideration.
If you want to choose mounting, you can try setting up mount point using credential passthrough.
Is one method faster than the other to get information from the stored data in the Azure Storage Account?
spark.conf.set
because it is accessible to all users.