pyspark databricks azure-data-lake azure-databricks

Connect to ADLS with Spark API in Databricks

I am trying to establish a connection to an ADLS using a Spark API. I am really new to this. I read the documentation where it says that you can establish the connection with the following code:

spark.conf.set("fs.adl.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("fs.adl.oauth2.client.id", "<application-id>")
spark.conf.set("fs.adl.oauth2.credential", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"))
spark.conf.set("fs.adl.oauth2.refresh.url", "https://login.microsoftonline.com/<directory-id>/oauth2/token")

I can see in the Azure Portal / Azure Storage Explorer that I have Read/Write/Execute permission on the ADLS folder that I need, but I don't know where to find application-id, scope-name, and key-name-for-service-credential.

Solution

There are two ways of accessing Azure Data Lake Storage Gen1:

Mount an Azure Data Lake Storage Gen1 filesystem to DBFS using a service principal and OAuth 2.0.
Use a service principal directly.

Prerequisites:

You need to Create and grant permissions to service principal.

Create an Azure AD application and service principal that can access resources.

Note the following properties:

application-id: An ID that uniquely identifies the client application.

directory-id: An ID that uniquely identifies the Azure AD instance.

service-credential: A string that the application uses to prove its identity.

Register the service principal, granting the correct role assignment, such as Contributor, on the Azure Data Lake Storage Gen1 account.

Method1: Mount Azure Data Lake Storage Gen1 resource or folder

Method2: Access directly with Spark APIs using a service principal and OAuth 2.0

Method3: Access directly with Spark APIs using a service principal and OAuth 2.0 with dbutils.secrets.get(scope = "", key = "") retrieves your storage account access key that has been stored as a secret in a secret scope.

Reference: Databricks - Azure Data Lake Storage Gen1.