I am trying to establish a connection to an ADLS using a Spark API. I am really new to this. I read the documentation where it says that you can establish the connection with the following code:
spark.conf.set("fs.adl.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("fs.adl.oauth2.client.id", "<application-id>")
spark.conf.set("fs.adl.oauth2.credential", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"))
spark.conf.set("fs.adl.oauth2.refresh.url", "https://login.microsoftonline.com/<directory-id>/oauth2/token")
I can see in the Azure Portal / Azure Storage Explorer that I have Read/Write/Execute permission on the ADLS folder that I need, but I don't know where to find application-id
, scope-name
, and key-name-for-service-credential
.
There are two ways of accessing Azure Data Lake Storage Gen1:
Prerequisites:
You need to Create and grant permissions to service principal.
Create an Azure AD application and service principal that can access resources.
Note the following properties:
application-id: An ID that uniquely identifies the client application.
directory-id: An ID that uniquely identifies the Azure AD instance.
service-credential: A string that the application uses to prove its identity.
Register the service principal, granting the correct role assignment, such as Contributor, on the Azure Data Lake Storage Gen1 account.
Method1: Mount Azure Data Lake Storage Gen1 resource or folder
Method2: Access directly with Spark APIs using a service principal and OAuth 2.0
Method3: Access directly with Spark APIs using a service principal and OAuth 2.0 with dbutils.secrets.get(scope = "", key = "") retrieves your storage account access key that has been stored as a secret in a secret scope.
Reference: Databricks - Azure Data Lake Storage Gen1.