Search code examples
pysparkdatabricksazure-data-lakeazure-databricks

Connect to ADLS with Spark API in Databricks


I am trying to establish a connection to an ADLS using a Spark API. I am really new to this. I read the documentation where it says that you can establish the connection with the following code:

spark.conf.set("fs.adl.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("fs.adl.oauth2.client.id", "<application-id>")
spark.conf.set("fs.adl.oauth2.credential", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"))
spark.conf.set("fs.adl.oauth2.refresh.url", "https://login.microsoftonline.com/<directory-id>/oauth2/token")

I can see in the Azure Portal / Azure Storage Explorer that I have Read/Write/Execute permission on the ADLS folder that I need, but I don't know where to find application-id, scope-name, and key-name-for-service-credential.


Solution

  • There are two ways of accessing Azure Data Lake Storage Gen1:

    1. Mount an Azure Data Lake Storage Gen1 filesystem to DBFS using a service principal and OAuth 2.0.
    2. Use a service principal directly.

    Prerequisites:

    You need to Create and grant permissions to service principal.

    Create an Azure AD application and service principal that can access resources.

    Note the following properties:

    application-id: An ID that uniquely identifies the client application.

    directory-id: An ID that uniquely identifies the Azure AD instance.

    service-credential: A string that the application uses to prove its identity.

    Register the service principal, granting the correct role assignment, such as Contributor, on the Azure Data Lake Storage Gen1 account.

    Method1: Mount Azure Data Lake Storage Gen1 resource or folder

    enter image description here

    Method2: Access directly with Spark APIs using a service principal and OAuth 2.0

    enter image description here

    Method3: Access directly with Spark APIs using a service principal and OAuth 2.0 with dbutils.secrets.get(scope = "", key = "") retrieves your storage account access key that has been stored as a secret in a secret scope.

    enter image description here

    Reference: Databricks - Azure Data Lake Storage Gen1.