Search code examples
snowflake-cloud-data-platformdbt

Is there a way to force a full refresh for a dbt incremental model if its model changed?


I'll implement dbt for pipelines in Snowflake with incremental models to save query costs but I want to manage the changes of schemas that will be quite frequent. I will have one daily ETL job for each env running a dbt run. Also, in qa and prod environments I'll not be able to run any cmd as I don't have access to these environments for security issues, only to dev.

Is it possible to trigger a full refresh of a model if its schema changed?

I saw that we can use the on_schema_change option with incremental models but this will just add (or drop) columns without populating them which is not exactly what I'm looking for as I'll not be able to run a force refresh manually in qa and prod.

Thanks a lot


Solution

  • Is it possible to [automatically] trigger a full refresh of a model if its schema changed?

    Sadly, the answer is 'no'.

    The docs for on_schema_change are unequivocal:

    Note: None of the on_schema_change behaviors backfill values in old records for newly added columns. If you need to populate those values, we recommend running manual updates, or triggering a --full-refresh.

    I read back through some of the old GitHub issues and the behaviour you desire was discussed, but ultimately not implemented.

    So what are your options?

    Frequent schema changes sounds bad! Is this because of upstream changes, or because you're iterating the DBT models?

    If the former, this becomes partly a communication problem. If your upstream colleagues can't help you even after you explain that this causes difficulties, then you're best off not doing an incremental model; you can explain that the increased costs are the price to support frequent schema changes.

    If the latter, hopefully the frequent changes will eventually cease? You could go non-incremental until then (avoid prematurely optimising).

    Otherwise: you're going to have to find a way to do occasional full refreshes. Could you deploy a full-refresh version whenever you deploy a schema change, followed by a non-full-refresh version the next day?

    In this picture let's imagine

    • you have semantically versioned models, for example you're on v1.2.1 -you want to release v1.3.0, which includes a new column on your incremental table
    • you will deploy a v1.3.0-prerelease with full_refresh=true added to the model config for the updated mode
    • then the next day (after the daily run) deploy v1.3.0 with this removed

    If you really have to do frequent schema changes and optimise your costs, this acknowledges that a full refresh ought to somehow be captured in version control, even if in a light touch way.