apache-spark pyspark databricks azure-databricks

How to call Cluster API and start cluster from within Databricks Notebook?

Currently we are using a bunch of notebooks to process our data in azure databricks using mainly python/pyspark.

What we want to achieve is make sure that our clusters are started (warmed up) before initiating the data processing. For that reason we are exploring ways to get access to the Cluster API from within databricks notebooks.

So far we tried running the following:

import subprocess
cluster_id = "XXXX-XXXXXX-XXXXXXX"
subprocess.run(
    [f'databricks clusters start --cluster-id "{cluster_id}"'], shell=True
)

which however returns below and nothing really happens afterwards. Cluster is not started.

CompletedProcess(args=['databricks clusters start --cluster-id "0824-153237-ovals313"'], returncode=127)

Is there any convenient and smart way to call the ClusterAPI from within databricks notebook or maybe call a curl command and how is this achieved?

Solution

Most probably the error is coming from the incorrectly configured credentials.

Instead of using command-line application it's better to use the Start command of Clusters REST API. This could be done with something like this:

import requests
ctx = dbutils.notebook.entry_point.getDbutils().notebook().getContext()
host_name = ctx.tags().get("browserHostName").get()
host_token = "your_PAT_token"
cluster_id = "some_id" # put your cluster ID here

requests.post(
    f'https://{host_name}/api/2.0/clusters/get',
    json = {'cluster_id': cluster_id},
    headers={'Authorization': f'Bearer {host_token}'}
  )

and then you can monitor the status using the Get endpoint until it gets into the RUNNING state:

response = requests.get(
    f'https://{host_name}/api/2.0/clusters/get?cluster_id={cluster_id}',
    headers={'Authorization': f'Bearer {host_token}'}
  ).json()
status = response['state']