Search code examples
apache-sparkazure-databricksazure-data-lake-gen2

Setting data lake connection in cluster Spark Config for Azure Databricks


I'm trying to simplify notebook creation for developers/data scientists in my Azure Databricks workspace that connects to an Azure Data Lake Gen2 account. Right now, every notebook has this at the top:

    %scala
    spark.sparkContext.hadoopConfiguration.set("fs.azure.account.auth.type.<datalake.dfs.core.windows.net", "OAuth")
    spark.sparkContext.hadoopConfiguration.set("fs.azure.account.oauth.provider.type.<datalake>.dfs.core.windows.net",  "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
    spark.sparkContext.hadoopConfiguration.set("fs.azure.account.oauth2.client.id.<datalake>.dfs.core.windows.net", <SP client id>)
    spark.sparkContext.hadoopConfiguration.set("fs.azure.account.oauth2.client.secret.<datalake>.dfs.core.windows.net", dbutils.secrets.get(<SP client secret>))
    spark.sparkContext.hadoopConfiguration.set("fs.azure.account.oauth2.client.endpoint.<datalake>.dfs.core.windows.net", "https://login.microsoftonline.com/<tenant>/oauth2/token")

Our implementation is trying to avoid mounting in DBFSS, so I've been trying to see if I can use the Spark Config on a cluster to define these values instead (each cluster can access a different data lake).

However, I haven't been able to get that to work yet. When I try various flavors of this:

org.apache.hadoop.fs.azure.account.oauth2.client.id.<datalake>.dfs.core.windows.net <sp client id>
org.apache.hadoop.fs.azure.account.auth.type.<datalake>.dfs.core.windows.net OAuth
org.apache.hadoop.fs.azure.account.oauth.provider.type.<datalake>.dfs.core.windows.net "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider"
org.apache.hadoop.fs.azure.account.oauth2.client.secret.<datalake>.dfs.core.windows.net {{secrets/secret/secret}}
org.apache.hadoop.fs.azure.account.oauth2.client.endpoint.<datalake>.dfs.core.windows.net "https://login.microsoftonline.com/<tenant>"

I get "Failure to initialize configuration" The version above looks like it's defaulting to try to use the storage account access key instead of the SP credentials (this is with just testing with a simple ls command to make sure it works).

ExecutionError: An error occurred while calling z:com.databricks.backend.daemon.dbutils.FSUtils.ls.
: Failure to initialize configuration
    at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:51)
    at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:412)
    at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:1016)

I'm hoping there's a way around this, although if the only answer is "you can't do it this way," that's an acceptable answer, of course.


Solution

  • If you want to connect to Azure Data Lake Gen2, include authentication information into Spark configuration as follows:

    spark.hadoop.fs.azure.account.oauth2.client.id.<datalake>.dfs.core.windows.net <sp client id>
    spark.hadoop.fs.azure.account.auth.type.<datalake>.dfs.core.windows.net OAuth
    spark.hadoop.fs.azure.account.oauth.provider.type.<datalake>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
    spark.hadoop.fs.azure.account.oauth2.client.secret.<datalake>.dfs.core.windows.net {{secrets/secret/secret}}
    spark.hadoop.fs.azure.account.oauth2.client.endpoint.<datalake>.dfs.core.windows.net https://login.microsoftonline.com/<tenant>/oauth2/token
    

    For more details, please refer to here and here enter image description here