Search code examples
azureazure-databricksdatabricks-unity-catalog

Unity Catalog - Access External location with a Service Principal


I am trying to access Azure Data Lake Storage Gen2 with a Service Principal via Unity Catalog.

  • Managed Identity is added with Contributor Role assigned to the storage account
  • Managed Identity is added as a Storage Credential
  • the storage container is added as an external location with this credential
  • the Service Principal is added with All Privileges on the external location

In PySpark I set the Spark config according to the Azure Gen 2 documentation:

from pyspark.sql.types import StringType

spark.conf.set(f"fs.azure.account.auth.type.{storage_account}.dfs.core.windows.net", "OAuth")
spark.conf.set(f"fs.azure.account.oauth.provider.type.{storage_account}.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set(f"fs.azure.account.oauth2.client.id.{storage_account}.dfs.core.windows.net", client_id)
spark.conf.set(f"fs.azure.account.oauth2.client.secret.{storage_account}.dfs.core.windows.net", client_secret)
spark.conf.set(f"fs.azure.account.oauth2.client.endpoint.{storage_account}.dfs.core.windows.net", f"https://login.microsoftonline.com/{tenant_id}/oauth2/token")

# create and write dataframe
df = spark.createDataFrame(["10","11","13"], StringType()).toDF("values")
df.write \
  .format("delta") \
  .mode("overwrite") \
  .save(f"abfss://{container}@{storage_account}.dfs.core.windows.net/example/example-0")

This unfortunaly this returns an unexpected:

Operation failed: "This request is not authorized to perform this operation using this permission.", 403, HEAD, https://{storage-account}.dfs.core.windows.net/{container-name}/example/example-0?upn=false&action=getStatus&timeout=90


Solution

  • When you use Unity Catalog you don't need these properties - they were needed prior to Unity Catalog and not used right now, or used only for clusters without UC for direct data access:

    spark.conf.set(f"fs.azure.account.auth.type.{storage_account}.dfs.core.windows.net", "OAuth")
    spark.conf.set(f"fs.azure.account.oauth.provider.type.{storage_account}.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
    spark.conf.set(f"fs.azure.account.oauth2.client.id.{storage_account}.dfs.core.windows.net", client_id)
    spark.conf.set(f"fs.azure.account.oauth2.client.secret.{storage_account}.dfs.core.windows.net", client_secret)
    spark.conf.set(f"fs.azure.account.oauth2.client.endpoint.{storage_account}.dfs.core.windows.net", f"https://login.microsoftonline.com/{tenant_id}/oauth2/token")
    

    The authentication to the given storage location will happen via mapping the storage credential to the external location path.

    But permissions will be checked for the user/service principal who is running a giving piece of code, so this user/principal should have corresponding permission on the external location. If you run this code as SP-assigned job, then it will have access. But if you run it as yourself, it won't work until you get permissions.