Search code examples
snowflake-cloud-data-platformdbt

DBT/Snowflake - identify updated and inserted rows per model


my stakeholders are requesting that after each model that updates or inserts new data (through merge) into some table, a log table needs to be updated with something like this:

log_table:

TABLE_LOADED    INSERTED_ROWS     UPDATED_ROWS
TAB_1           1500              1000

Any idea on how to approach this? How to identifies maybe with a post hook or a package that counts the inserted rows and updated rows separately?

Thanks


Solution

  • Apart from the Usage view in Snowflake, my suggestion is to leverage the row level metadata in your every model(s) (ref to https://infinitelambda.com/dbt-observability-row-level-metadata/), something that if you got the timestamp fields in your model e.g. inserted_at, updated_at, you could probably achieve count the rows:

    • row inserted: inserted_at = updated_at = T1
    • row updated: inserted_at = T1, updated_at = T2
    • (optional) row deleted: inserted_at = T1, updated_at = T2 and is_deleted = true

    By having these things, you could use pre/post-hook to count rows and insert it to log table:

    • INSERTED_ROWS:
    select  count(*)
    from    {{ this }}
    where   inserted_at = updated_at
        and inserted_at > {{ run_start_at }} 
        --or better to have a variable in the model that hold the the start time of the model (be mindful of the timezone)
    
    • UPDATED_ROWS:
    select  count(*)
    from    {{ this }}
    where   inserted_at < updated_at
        and inserted_at > {{ run_start_at }} 
        --or better to have a variable in the model that hold the the start time of the model (be mindful of the timezone)
    

    Hope this helps!