Search code examples
pythonpostgresqldbt

Using materialized=‘incremental’ to append new campaigns to the table


I need help in creating my first incremental model.

In the campaignchannel table, I have a surrogate_key column which takes the clientcode_id and concats the index number to create it, which acts like an unique key. The idea is that when a new campaign is created we will find the max surrogate_id for that client and +1 so we have a way of having a unique id for all campaigns by client. Will incremental work for this case?

Also, is it good to have a backup for incremental models?

This is python sql code

-- dbt campaign channel level dimension : dim_campaignchannel

{{ config(materialized='incremental', unique_key=surrogate_key) }}

SELECT  --ROW_NUMBER() OVER (ORDER BY c.clientcode, c.adservername) AS index_number,
CAST(CONCAT(c.clientcode_id, ROW_NUMBER() OVER (ORDER BY c.clientcode, c.adservername)) as numeric) AS surrogate_key, c.* from
(
SELECT b.clientcode_id, a.clientcode, a.adservername, a.mediachannel, a.adtech, a.programname, a.funnelstage, a.period, a.season, a.campaigntype
      ,a.lob, a.businessline, a.objective, a.market, a.targettype, a.subcampaign, a.campaignyear, a.startdate, a.enddate
FROM  public.map_campaign_segments a
JOIN
public.map_client_segments b ON a.clientcode = b.clientcode
where a.adservername != '-'
order by a.clientcode, a.adservername) c
order by c.clientcode, c.adservername

This is what I came up with, not sure if this is the right approach:

-- dbt campaign channel level dimension : dim_campaignchannel_master

{{ config(materialized='incremental',
          unique_key='surrogate_key') }}

SELECT
    CASE
        WHEN EXISTS (
            SELECT 1 FROM dim_campaignchannel WHERE adservername = c.adservername
        ) THEN
            CAST(CONCAT(c.clientcode_id, ROW_NUMBER() OVER (ORDER BY c.clientcode, c.adservername)) as numeric) -- Generate a new surrogate key if adservername exists
        ELSE
            (SELECT MAX(surrogate_key) + 1 FROM dim_campaignchannel WHERE clientcode = c.clientcode) -- Increment the surrogate key for existing clientcode
    END AS surrogate_key,
    c.*
FROM (
    SELECT
        b.clientcode_id, a.clientcode, a.adservername, a.mediachannel, a.adtech, a.programname, a.funnelstage, a.period, a.season, a.campaigntype,
        a.lob, a.businessline, a.objective, a.market, a.targettype, a.subcampaign, a.campaignyear, a.startdate, a.enddate
    FROM
        public.map_campaign_segments a
    JOIN
        public.map_client_segments b ON a.clientcode = b.clientcode
    WHERE
        a.adservername != '-'
    ORDER BY
        a.clientcode, a.adservername
) c

{% if is_incremental() %}

  -- this filter will only be applied on an incremental run
  -- (uses >= to include records arriving later on the same day as the last run of this model)
  WHERE
      NOT EXISTS (
          SELECT 1 FROM dim_campaignchannel WHERE adservername = c.adservername
      )

{% endif %}

ORDER BY
    c.clientcode, c.adservername

Solution

  • There is probably a misconception of what incremental materialization is for.

    Incremental models:
    ✅ serve to optimize a dbt workload (if dbt run executes too long)
    ❌ do not serve to generate unique keys incrementally

    Chances are high that you don't need an incremental model at this point.
    dbt best practices guide says:

    Start as simple as possible

    • 🔍 Start with a view. When the view gets too long to query for end users,
    • ⚒️ Make it a table. When the table gets too long to build in your dbt Jobs,
    • 📚 Build it incrementally. That is, layer the data on in chunks as it comes in.

    So it's a good time to start thinking about moving to incremental when dbt runs become too long and heavy. At the same time, it has nothing to do with generating surrogate/primary/incremental keys.


    Several notes:

    • ROW_NUMBER() OVER (ORDER BY c.clientcode, c.adservername)) can be error-prone. Because at different times the surrogate_key can have different values for the same client-campaign. If possible, consider adding a timestamp ROW_NUMBER() OVER (ORDER BY c.clientcode, c.adservername, c.campaign_created_timestamp))
    • Check out this post about why it's better to avoid auto-incremental id in dbt: link
    • You might want to consider changing your surrogate key using hash. Here is why it's beneficial. And here is a handy macro from dbt_utils that is more reliable than md5