Search code examples
dbt

Can we add dbt test in post hook in config?


my ask is something like this:

{{ config({ "materialized":'table', "post_hook":["dbt test -s model_name", "insert into {{ source('tgt_schema','tgt_table') }} select * from {{source('source_schema','source_tbl') }}" ]}) }}

I want to test model before inserting data into target table


Solution

  • No, this won't work the way you want it to:

    1. Post hooks are executed by your database, not on your command line. So the dbt test -s model_name will fail, since your database doesn't know what that means
    2. You have to manually manage transactions in hooks; most likely your insert statement will not be committed and will instead be rolled back

    Thinking about ddl like insert is typically a dbt antipattern. For a typical dbt model materialized as a table, dbt wraps your select statement in a create table as (...) statement, which effectively "inserts" the data into a new table. The typical dbt workflow would build this new table, and then test the new table -- before you've built the table there isn't really anything to test.

    I think what you probably want is something closer to a blue-green deployment, where you build your production tables in a new location (schema or database), test that deployment, and if the tests are successful, you swap in the newly-built tables for the old ones. This is a great pattern, although much more complex than a vanilla dbt deployment.

    If you really want to test a query before materializing it, you can break that up into two models, and use the ephemeral materialization on the first one. You can then test the first model before building the downstream one. (Note that this may be terrible from a performance standpoint, but ymmv).

    So, for example, in model_one.sql

    {{ config(materialized='ephemeral') }}
    select my, complex, logic
    from {{ ref('my_upstream_table') }}
    

    Then you'd add tests in model_one.yml:

    version: 2
    models:
      - name: model_one
        columns:
          - name: my
            tests:
              - not_null
              ...
    

    Then in model_two.sql

    {{ config(materialized='table') }}
    select * from {{ ref('model_one') }}
    

    Then at the command line,

    $ dbt build
    

    Will first run all of the tests against model_one and then materialize it as a table by building model_two.

    This would also work fine if model_two were an incremental model, if you wanted to select a subset of your data using model_one. But if you did that, you would lose the ability to use the is_incremental() macro, and --full-refresh would not work. In that case, I would probably hardcode the filter in model_one and then repeat all of that logic again in model_two

    model_one.sql:

    {{ config(materialized='ephemeral') }}
    select my, complex, logic
    from {{ ref('my_upstream_table') }}
    where my_timestamp > getdate() - interval '1 day'
    

    Then in model_two.sql

    -- we're no longer selecting from model_one,
    -- so put a ref in a comment to make dbt think
    -- it is downstream, so it only builds after
    -- the tests run on model_one
    -- {{ ref('model_one') }}
    
    {{ config(materialized='incremental') }}
    select my, complex, logic
    from {{ ref('my_upstream_table') }}
    {% if is_incremental() %}
    where my_timestamp >= (select max(my_timestamp) from {{ this }})
    {% endif %}