azure-pipelines databricks azure-databricks databricks-workflows

Azure Pipeline Step - Trigger Databricks Job

I am trying to trigger a databricks job in a pipeline step where I use the job id passed as a variable from the previous step.

This is how I create the job id and pass it as a variable:

- script: |
    job_id=$(databricks jobs create --json '{"name": "test", "existing_cluster_id" : "'"$(db_clusterid)"'", "notebook_task ": {"notebook_path": "'"$(nbpath)"'"}}')
    echo "##vso[task.setvariable variable=db_job_id;]'"$job_id"'"
  env:
    DB_HOST: $(db_host)
    DB_TOKEN: $(db_token)
  displayName: 'Create Job'

When I echo the variable in the next step it looks as expected:

- script: |
    echo $DB_JOB_ID
  env:
    DB_JOB_ID: $(db_job_id)
    DB_HOST: $(db_host)
    DB_TOKEN: $(db_token)
  displayName: 'Echo Job ID'

Output from echo:

'{ "job_id": 123 }'

However, when I try to run the job as follows:

- script: |
    databricks jobs run-now --job-id $DB_JOB_ID
  env:
    DB_JOB_ID: $(db_job_id)
    DB_HOST: $(db_host)
    DB_TOKEN: $(db_token)
  displayName: 'Run Job'

The following error message arises:

Error: Got unexpected extra arguments ("job_id": 123}')

Instead of providing $DB_JOB_ID I also tried "$DB_JOB_ID" and "'"$DB_JOB_ID"'" which did not work either.

What would be the correct statement?

Solution

Your problem is that you're putting the whole JSON that is returned, while the run-now is requiring only job ID that is number. You can replace the --job-id $DB_JOB_ID with --job-id $(echo $DB_JOB_ID||sed -e 's|^.*:[ ]*$[0-9][0-9]*$[ ]*.*$|\1|') - it will extract only required job ID.

P.S. Instead of the databricks jobs create as one step and then running databricks jobs run-now, it's better to use databricks jobs submit (or use Run Submit REST API) - it just will run the job without creating it.

You can also look onto the dbx package developed inside Databricks - it may simplify the way how do you schedule jobs, wait for results, etc.