Search code examples
databricksazure-databricks

How can changes between SDCL environments be propagated in Databricks?


I am just trying to understand how all of this work.

Let's say that I create 3 workspaces (one for each environment).

Workspace Environment
1 development
2 testing
3 production

I create my declarative objects (database, schemas) using Terraform. Until here I think I'm following best practices.

In GitHub, I have a branch for each environment.

Now, my questions are:

  1. If changes are made to DB can I simply alter my Terraform script and these changes will be detected and applied?
  2. Additionally, is a simple Pull request between branches enough to propagate these changes?

Now, for the imperative objects (tables, views, stored procedures, etc.).

These kinds of objects change all the time, so as far as I'm aware, it's not best practice to define them using Terraform. So, now I'm left with many questions.

  1. How should these objects be defined?
  2. How will changes between environments be propagated?

Any help is welcome!


Solution

  • I'll try to answer in the two groups of questions being asked, one is related to CI/CD and the other towards management of tables, views, and stored procedures which I'll just refer to as Tables for simplicity.

    CI/CD with Multiple Workspaces

    If changes are made to DB can I simply alter my Terraform script and these changes will be detected and applied?

    Terraform does perform drift detection of when the real infrastructure does not match the Terraform configuration; however, Terraform's job is to always try to make the resource match the state of your configuration. For example, say you use Terraform to define a schema retail like so:

    resource "databricks_catalog" "retail" {
      metastore_id = databricks_metastore.this.id
      name         = "retail"
      comment      = ""
    }
    

    Then tomorrow a team member manually uses the Databricks UI to add a comment to this schema stating COMMENT "Store information and retail data". The next time you release and apply your Terraform, it will UNDO your team member's comment because it is not in the Terraform configuration.

    That said, if you manually create a schema that doesn't collide with a schema being managed with Terraform (e.g. different names) then technically there'd be no issue, although this would be an inconsistent and possibly confusing practice.

    Additionally, is a simple Pull request between branches enough to propagate these changes?

    Depending on your Git branching strategy (e.g. GitFlow, Trunk, etc.) you may have different preferred triggers for each environment/workspace. Since you stated you are using different branches for each environment, this is most similar to Gitflow and Pull Requests from lower environment branches (e.g. dev) is a common way to promote your changes.

    For example, if your branches are dev, test, and prod, then you can use GitHub Actions to automatically apply your Terraform for the respective Databricks workspace depending on the target branch that was merged into. See the Hashicorp documentation on automating Terraform with GitHub Actions for more.

    You can also find a very good overview and example of Databricks CI/CD with multiple workspaces on the dbx DevOps concepts page, although dbx does overlap with Terraform in some minor ways.


    Managing Table Assets

    How should these objects be defined?

    Although Databricks tables can be managed with Terraform using the databricks_sql_table resource, views and stored procedures cannot at this time; furthermore, you are correct that it is usually not recommended to use Terraform to manage such volatile resources which are not truly Infrastructure.

    Instead, one alternative is to code these into your workflows (notebook, Python, JAR, etc.) using statements such as spark.sql("CREATE TABLE IF NOT EXISTS ...").

    With this strategy, table alterations such as adding columns in the future can create more complex code, but also remember that you can use the .option("mergeSchema", "true") in your DataFrameWriter when writing a dataframe to a table and this may be all you need as long as you are confident in the schema of your dataframe at runtime.

    How will changes between environments be propagated?

    If using the code approach described above (spark.sql("CREATE ...") and/or mergeSchema on writes), your changes are now a fundamental part of your code itself and therefore will naturally propagate when you push new versions of your notebooks/wheels/JARs.