I have a project for which I want to be able to run some entry points on databricks. I used dbx for that, having the following deployment.yaml
file:
build:
python: "poetry"
environments:
default:
workflows:
- name: "test"
existing_cluster_id: "my-culster-id"
spark_python_task:
python_file: "file://tests/test.py"
I'm able to run the test script with the execute
command:
poetry run dbx execute --cluster-id=my-culster-id test
My problem with this option is that it launches the script interactively and I can't really retrieve the executed code on Databricks, except by looking at the cluster's logs.
So I tried using the deploy
and launch
commands, such that a proper job is created and run on Databricks.
poetry run dbx deploy test && poetry run dbx launch test
However the job run fails with the following error, which I don't understand:
Run result unavailable: job failed with error message
Library installation failed for library due to user error. Error messages:
'Manage' permissions are required to modify libraries on a cluster
In any case, what do you think is the best way to run a job that can be traced on Databricks from my local machine ?
Based on doc here:
The dbx execute command runs your code on all-purpose cluster. It's very handy for interactive development and data exploration.
Don't use dbx execute for production workloads
It's not recommended to use dbx execute for production workloads. Run your workflows on the dedicated job clusters instead. Reasoning is described in detail in the concepts section.
In contrast to the dbx execute, dbx launch launches your workflow on a dedicated job cluster. This is a recommended way for CI pipelines, automated launches etc.