I am just trying to understand how all of this work.
Let's say that I create 3 workspaces (one for each environment).
Workspace | Environment |
---|---|
1 | development |
2 | testing |
3 | production |
I create my declarative objects (database, schemas) using Terraform. Until here I think I'm following best practices.
In GitHub, I have a branch for each environment.
Now, my questions are:
Now, for the imperative objects (tables, views, stored procedures, etc.).
These kinds of objects change all the time, so as far as I'm aware, it's not best practice to define them using Terraform. So, now I'm left with many questions.
Any help is welcome!
I'll try to answer in the two groups of questions being asked, one is related to CI/CD and the other towards management of tables, views, and stored procedures which I'll just refer to as Tables for simplicity.
If changes are made to DB can I simply alter my Terraform script and these changes will be detected and applied?
Terraform does perform drift detection of when the real infrastructure does not match the Terraform configuration; however, Terraform's job is to always try to make the resource match the state of your configuration. For example, say you use Terraform to define a schema retail
like so:
resource "databricks_catalog" "retail" {
metastore_id = databricks_metastore.this.id
name = "retail"
comment = ""
}
Then tomorrow a team member manually uses the Databricks UI to add a comment to this schema stating COMMENT "Store information and retail data"
. The next time you release and apply your Terraform, it will UNDO your team member's comment because it is not in the Terraform configuration.
That said, if you manually create a schema that doesn't collide with a schema being managed with Terraform (e.g. different names) then technically there'd be no issue, although this would be an inconsistent and possibly confusing practice.
Additionally, is a simple Pull request between branches enough to propagate these changes?
Depending on your Git branching strategy (e.g. GitFlow, Trunk, etc.) you may have different preferred triggers for each environment/workspace. Since you stated you are using different branches for each environment, this is most similar to Gitflow and Pull Requests from lower environment branches (e.g. dev
) is a common way to promote your changes.
For example, if your branches are dev
, test
, and prod
, then you can use GitHub Actions to automatically apply your Terraform for the respective Databricks workspace depending on the target branch that was merged into. See the Hashicorp documentation on automating Terraform with GitHub Actions for more.
You can also find a very good overview and example of Databricks CI/CD with multiple workspaces on the dbx DevOps concepts page, although dbx does overlap with Terraform in some minor ways.
How should these objects be defined?
Although Databricks tables can be managed with Terraform using the databricks_sql_table resource, views and stored procedures cannot at this time; furthermore, you are correct that it is usually not recommended to use Terraform to manage such volatile resources which are not truly Infrastructure.
Instead, one alternative is to code these into your workflows (notebook, Python, JAR, etc.) using statements such as spark.sql("CREATE TABLE IF NOT EXISTS ...")
.
With this strategy, table alterations such as adding columns in the future can create more complex code, but also remember that you can use the .option("mergeSchema", "true")
in your DataFrameWriter when writing a dataframe to a table and this may be all you need as long as you are confident in the schema of your dataframe at runtime.
How will changes between environments be propagated?
If using the code approach described above (spark.sql("CREATE ...")
and/or mergeSchema
on writes), your changes are now a fundamental part of your code itself and therefore will naturally propagate when you push new versions of your notebooks/wheels/JARs.