Search code examples
databricksdatabricks-dbx

Differences between databricks dbx execute and launch command


I have a project for which I want to be able to run some entry points on databricks. I used dbx for that, having the following deployment.yaml file:

build:
  python: "poetry"

environments:
  default:
    workflows:
      - name: "test"
        existing_cluster_id: "my-culster-id"
        spark_python_task:
          python_file: "file://tests/test.py"

I'm able to run the test script with the execute command:

poetry run dbx execute --cluster-id=my-culster-id test

My problem with this option is that it launches the script interactively and I can't really retrieve the executed code on Databricks, except by looking at the cluster's logs.

So I tried using the deploy and launch commands, such that a proper job is created and run on Databricks.

poetry run dbx deploy test && poetry run dbx launch test

However the job run fails with the following error, which I don't understand:

Run result unavailable: job failed with error message
Library installation failed for library due to user error. Error messages:
'Manage' permissions are required to modify libraries on a cluster

In any case, what do you think is the best way to run a job that can be traced on Databricks from my local machine ?


Solution

  • Based on doc here:

    The dbx execute command runs your code on all-purpose cluster. It's very handy for interactive development and data exploration.

    Don't use dbx execute for production workloads

    It's not recommended to use dbx execute for production workloads. Run your workflows on the dedicated job clusters instead. Reasoning is described in detail in the concepts section.

    In contrast to the dbx execute, dbx launch launches your workflow on a dedicated job cluster. This is a recommended way for CI pipelines, automated launches etc.