Search code examples
azure-devopsdatabricksazure-databricksazure-pipelines-yamldatabricks-workflows

How to pass storage account key to databricks parameter from Devops Pipeline


I am trying to pass storage account key to databricks parameter from Devops pipeline.

trigger:
- development

pool: SharedAKS

jobs:
- job: AzureCLI
  steps:
  - checkout: self
  - task: AzureCLI@2
    inputs:
      azureSubscription: $(azureSubscription)
      addSpnToEnvironment: true
      scriptType: 'pscore'
      scriptLocation: 'inlineScript'
      inlineScript: |
        #databrick cli
        # install databrick-cli
        $rg= ""
        $storageAccountName=""
        $resourceGroup=$(az group list --query "[?contains(name, '$(rg)')].name | [0]" --output tsv)
        $accountKey=$(az storage account keys list --resource-group $rg --account-name $storageAccountName --query "[0].value" --output tsv)
        $env:STORAGE_ACCOUNT_KEY = $accountKey
      

        
        echo "Storage Account Key: $accountKey"
        python -m pip install --upgrade pip setuptools wheel databricks-cli
        
        $wsId=(az resource show --resource-type Microsoft.Databricks/workspaces -g $(rg) -n $(databricksName) --query id -o tsv)
        $workspaceUrl=(az resource show --resource-type Microsoft.Databricks/workspaces -g $(rg) -n $(databricksName) --query properties.workspaceUrl --output tsv)

        $workspaceUrlPost='https://'
        $workspaceUrlPost+=$workspaceUrl
        $workspaceUrlPost+='/api/2.0/token/create'
        echo "Https Url with Post: $workspaceUrlPost"

        $workspaceUrlHttps='https://'
        $workspaceUrlHttps+=$workspaceUrl
        $workspaceUrlHttps+='/'
        echo "Https Url : $workspaceUrlHttps"

        # token response for the Azure Databricks app
        $token=(az account get-access-token --resource $(AZURE_DATABRICKS_APP_ID) --query "accessToken" --output tsv)
        echo "Token retrieved: $token"

        # Get a token for the Azure management API
        $azToken=(az account get-access-token --resource https://management.core.windows.net/ --query "accessToken" --output tsv)

        # Create PAT token valid for approximately 10 minutes (600 seconds). Note the quota limit of 600 tokens.
        $pat_token_response=(curl --insecure -X POST ${workspaceUrlPost} `
          -H "Authorization: Bearer $token" `
          -H "X-Databricks-Azure-SP-Management-Token:$azToken" `
          -H "X-Databricks-Azure-Workspace-Resource-Id:$wsId" `
          -d '{"lifetime_seconds": 6000,"comment": "this is an example token"}')
        
        echo "Token retriev: $token" 
        echo "DATABRICKS_TKN: $pat_token_response"
          
        # Print PAT token
        $dapiToken=($pat_token_response | ConvertFrom-Json).token_value
        #dapiToken=$(echo $pat_token_response | jq -r .token_value)
        echo "DATABRICKS_TOKEN: $dapiToken"
        $DATABRICKSTKN = $dapiToken
        echo "##vso[task.setvariable variable=DATABRICKSTKN]$DATABRICKSTKN"

  - script: |
      echo "$(DATABRICKSTKN)"
      echo "Starting Databricks notebook upload..."
      # Install Databricks CLI
      pip install databricks-cli
      echo "DATABRICKS_TOKEN: $(DATABRICKSTKN)"

      # Authenticate with Databricks using the PAT
      echo "Authenticating with Databricks..."
      echo "DATABRICKS_TOKEN: $dapiToken"
      databricks configure --token <<EOF
      https://adb-82.14.azuredatabricks.net
      $(DATABRICKSTKN)
      EOF
      
    displayName: 'Upload Databricks Notebooks Job'
  
  - task: Bash@3
    displayName: 'Schedule Databricks Notebook'
    inputs:
      targetType: 'inline'
      script: | 
        databricksUrl='https://adb-8.14.azuredatabricks.net/api/2.0'
        
        notebookPath1='/Users/user/notebook'
        
        
        jobName2='testjob'

       
        
        requestUriRun="$databricksUrl/jobs/runs/submit"

       

        body2='{
          "name": "'$jobName2'",
          "new_cluster": {
            "spark_version": "7.3.x-scala2.12",
            "node_type_id": "Standard_DS3_v2",
            "num_workers": 0
          },
          "notebook_task": {
            "notebook_path": "'$notebookPath1'",
            "base_parameters": {
              "env": {"STORAGE_ACCOUNT_KEY": "'$STORAGE_ACCOUNT_KEY'"}
            }
          }
        }'


        curl -X POST -H "Authorization: Bearer $(DATABRICKSTKN)" -H "Content-Type: application/json" -d "$body2" "$requestUriRun"

I can see the below in the pipeline logs.

Storage Account Key: ***

Below is the Databricks setup. Cell:1

%python
dbutils.widgets.text("env", "", "Environment Variable")
env = dbutils.widgets.get("env")
print("Value of 'env' parameter:", env)

Output

Value of 'env' parameter: 

Cell:2

%python
    # Databricks notebook source
    storage_account_name = ""
    storage_account_access_key = env
    container = "raw"
    mountDir = ""
dbutils.fs.mount(
      source = "wasbs://"+container+"@xxxx.blob.core.windows.net",
      mount_point = "/mnt/" + mountDir,extra_configs = {"fs.azure.account.key."+storage_account_name+".blob.core.windows.net":storage_account_access_key})

Error:

shaded.databricks.org.apache.hadoop.fs.azure.AzureException: java.lang.IllegalArgumentException: Storage Key is not a valid base64 encoded string.

I see empty key in databricks , when I tried to print it and when I pass the variable to cell2, I am getting the above error. Am I passing the storage account key properly in the json body.

Thank you.


Solution

  • After getting the value of storage account key using the commend below.

    $accountKey=$(az storage account keys list --resource-group $rg --account-name $storageAccountName --query "[0].value" --output tsv)
    

    If you want to pass the value to the subsequent pipeline tasks, you need to use the logging command 'SetVariable' to set a pipeline variable with the value. Then the subsequent tasks can use the value via calling the pipeline variable.

    1. If you want to pass the value of storage account key to the subsequent tasks within the same job, you can set a general pipeline variable with the value. The command also automatically maps an environment variable for the general variable. This variable will be only available for the subsequent tasks within the same job.

      For example.

      $accountKey=$(az storage account keys list --resource-group $rg --account-name $storageAccountName --query "[0].value" --output tsv)
      Write-Host "##vso[task.setvariable variable=STORAGE_ACCOUNT_KEY;]$accountKey"
      
    2. If you want to pass the value of storage account key to other jobs or stages within the same pipeline, you can set an output variable with the value. For more details, see "Use output variables from tasks".

      For example.

      $accountKey=$(az storage account keys list --resource-group $rg --account-name $storageAccountName --query "[0].value" --output tsv)
      Write-Host "##vso[task.setvariable variable=STORAGE_ACCOUNT_KEY;isoutput=true]$accountKey"
      

    In addition, the following command you are using on the AzureCLI@2 task, it may just set up a temporary environment variable that could be only available for current session of the task. After the task is completed, it could be discarded and not available for the subsequent tasks.

    $env:STORAGE_ACCOUNT_KEY = $accountKey