I am currently optimizing our ETL process, and would like to be able to see the existing cluster configuration used when processing data. This way, I can track over time which worker node sizes I should use.
Is there a command to return the cluster worker # and sizes in python so I can write as a dataframe?
You can get this information by calling Cluster Get REST API - it will return JSON including the number of workers, node types, etc. Something like this:
import requests
ctx = dbutils.notebook.entry_point.getDbutils().notebook().getContext()
host_name = ctx.tags().get("browserHostName").get()
host_token = "your_PAT_token"
cluster_id = ctx.tags().get("clusterId").get()
response = requests.get(
f'https://{host_name}/api/2.0/clusters/get?cluster_id={cluster_id}',
headers={'Authorization': f'Bearer {host_token}'}
).json()
num_workers = response['num_workers']
P.S. if you have non-notebook job, then PAT token may not be available, but you can generate your token, and put it there instead