azure-data-lakeazure-hdinsightazure-data-lake-gen2

How can I attach both Azure Data Lake Gen1 and Gen2 to a single Spark HD Insight cluster?


I want to create an Azure HDI cluster that can access both ADLSg1 and ADLSg2 data lakes. Is this supported?


Solution

  • This is possible for Spark 2.4 (HDI 4.0) with restrictions:

    1. The cluster's primary storage must be Azure Blob
    2. The cluster will authenticate with ADLSg1 and ADLSg2 through service principals + client secrets. The service principals and secrets must be managed manually.
    3. This can't be done via the Azure Portal. You must change the cluster's core-site.xml configurations manually either via the Ambari UI or via ssh.

    Steps:

    1. Find your Azure AD tenant ID
    2. Register an application with Azure AD and create a service principal for each one of the ADLS accounts. Take note of the application ID(s).
    3. Create a new application secret for each one of the AAD applications created in step 1. Take note of the client secret(s) generated.
    4. Grant Owner role to the service principal associated with the ADLSg1 account.
    5. Grant Storage Blob Data Owner role to the service principal associated with the ADLSg2 account.
    6. Deploy the HDI cluster with Azure Blob as primary storage only.
    7. Open the Ambari UI for the HDI cluster.
    8. Navigate to HDFS → Configs tab UI → Advanced tab
    9. Expand the "Custom core-site" section
    10. Add the following settings:

    For ADLS Gen 1:

    fs.adl.oauth2.access.token.provider.type = ClientCredential
    fs.adl.oauth2.client.id = <ADLSg1 Application ID>
    fs.adl.oauth2.credential = <ADLSg1 Client Secret>
    fs.adl.oauth2.refresh.url = https://login.microsoftonline.com/<Tenant ID>/oauth2/token
    

    For ADLS Gen 2:

    fs.azure.account.auth.type.<ADLSg2 storage account name>.dfs.core.windows.net = OAuth
    fs.azure.account.oauth.provider.type.<ADLSg2 storage account name>.dfs.core.windows.net = org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
    fs.azure.account.oauth2.client.id.<ADLSg2 storage account name>.dfs.core.windows.net = <ADLSg2 Application ID>
    fs.azure.account.oauth2.client.secret.<ADLSg2 storage account name>.dfs.core.windows.net = <ADLSg1 Client Secret>
    fs.azure.account.oauth2.client.endpoint.<ADLSg2 storage account name>.dfs.core.windows.net = https://login.microsoftonline.com/<Tenant ID>/oauth2/token
    
    1. Save the changes
    2. Restart all affected services in the cluster (HDFS, YARN, etc.)

    To access files from the cluster:

    Use the fully qualified name. With this approach, you provide the full path to the file that you want to access.

    • ADLSg1: adl://<data_lake_account>.azuredatalakestore.net/<cluster_root_path>/<file_path>
    • ADLSg2: abfs://<containername>@<accountname>.dfs.core.windows.net/<file.path>/