Search code examples
azure-databricksdatabricks-unity-catalogdatabricks-workflowsdatabricks-asset-bundle

How to set the default catalog with databricks asset bundles in different environments for dlt and notebook tasks?


I would like to use databricks asset bundles to set the default catalog/schemas in different environments in order to refer to it in scripts/ creating paths etc. The DAB would be deployed with Azure pipelines. The databricks.yaml would look for example like this for the catalog:

bundle:
  name: "DAB"

variables:
  default_catalog: 
    description: Catalog to set and use 
    default: catalog_dev

include:
  - "resources/*.yaml"  

targets:
  dev:
    variables:
      default_catalog: catalog_dev
    workspace:
      host: xxxx

  prod:
    variables:
      default_catalog: catalog_prod
    workspace:
      host: xxxx

I have workflow tasks that are delta live tables, there I can use the catalog definition e.g.

  pipelines:
    bronze_scd2:
      name: bronze_scd2 
      clusters:
        - label: default
          autoscale:
            min_workers: 1
            max_workers: 5
            mode: ENHANCED
      libraries:
        - notebook:
            path: ${workspace.file_path}/1b_bronze/bronze_scd2.py
      target: bronze
      development: false
      catalog: ${var.default_catalog}

but how would you best set it for notebook task? I could create a widget for every single notebook that I am using (as described in the documentation here) but that doesn't seem to be the most efficient way .

tasks:
        - task_key: ingest_api
          notebook_task:
            notebook_path: ${workspace.file_path}/ingest_api
            source: WORKSPACE
            base_parameters:
              catalog: ${var.default_catalog}
          [...]

Additionally that would require to set it in the notebooks e.g. like this

dbutils.widgets.text("catalog", "")
catalog = dbutils.widgets.get("catalog")

It seems to make most sense to use DAB as the configuration files especially if you have a combination of dlt and notebook tasks- but are there any recommendations on how to best share catalog settings/ schema names in different databricks environments (seperate configuration files/ maybe setting environment variables through pipelines)? Would apprecciate any hints and recommendations.


Solution

  • You use parameter, spark configuration or spark environment variable to use the variables in spark notebook.

    Since, you are not that satisfied using notebook parameters you can use either spark configuration or spark environment variable.

    There are 2 places you can set this, in job_clusters settings under jobs mappings or new_cluster settings under tasks mappings

    Here, is the sample.

    In job_clusters settings

    resources:
      jobs:
        <some-unique-programmatic-identifier-for-this-job>:
          # ...
          job_clusters:
            - job_cluster_key: <some-unique-programmatic-identifier-for-this-key>
              new_cluster:
                    node_type_id: i3.xlarge
                    num_workers: 0
                    spark_version: 14.3.x-scala2.12
                    spark_conf:
                          "taskCatalog":${var.default_catalog}
                
    

    And the same you access with spark like below.

    spark.conf.get("taskCatalog")
    

    Refer this for more about job cluster task settings override.

    OR

    In new_cluster settings

    resources:
      jobs:
        my-job:
          name: my-job
          tasks:
            - task_key: my-key
              new_cluster:
                spark_version: 13.3.x-scala2.12
                node_type_id: i3.xlarge
                num_workers: 0
                spark_conf:
                      "taskCatalog":${var.default_catalog}
    

    Refer this for more about tasks settings override.

    Same way you can configure spark environment variables.

    spark_env_vars:
            "taskCatalog":${var.default_catalog}
    

    and access it like below.

    import os 
    default_catalog = os.getenv("taskCatalog", "catalog_dev")
    

    Also, refer few samples here.