Search code examples
databricksazure-databrickscicddatabricks-unity-catalog

How to handle CICD for spark sql code when moving to Databricks's Unity Catalog?


Parameterizing catalog name in Python seems straightforward. But for the notebooks and dashboards where we are using spark sql queries they are usually in schema.table format. Since we use different workspaces for dev/qa/prod, same code deployed in different environment works well. And same code goes in git and is deployed in dev/qa/prod environments without any updates required. Now when using Unity Catalog we will be uniquely identifying dev/qa/prod datasets using catalog concept. And sql code will have to refer the datasets as catalog.schema.table. This is problematic for CICD and GIT. What would be the best practice for this scenario?


Solution

  • The simplest way to do this if you already have existing code using schema.table naming notation is to set the catalog at the beginning of your job/notebook/code.

    This can be done one of two ways:

    1. With Spark SQL: spark.sql(f"USE CATALOG {catalog}")
    2. Using a Spark Conf on your cluster of spark.databricks.sql.initial.catalog.name=prod

    If you use option #1 I would recommend creating a widget or parameter to pass the value of your catalog name (dev/qa/prod), which can then be provided easily by tools like Terraform, Databricks CLI, dbx, etc.

    Option #2 may be favorable if you want to avoid changing any source code at all, because the Spark configuration can be set on the cluster settings. See Databricks documentation on this spark setting here: https://docs.databricks.com/data-governance/unity-catalog/hive-metastore.html#default-catalog