Search code examples
google-bigquerydbt

How to label a dbt project so that I can query the INFORMATION_SCHEMA for all jobs created by this project?


  • dbt-core: 1.8.0
  • dbt-bigquery: 1.8.1

I have a dbt-core project that includes 2 tables (1 incremental, 1 that aggregates the incremental table).

Since dbt only logs the direct *.sql files but not the incremental macro magic with the MERGE statements the logged bytes_billed in the /target/run_results.json are missing the majority of the costs when you run your project.

To solve this issue I need a way to get all jobs that are created when I run my dbt project.

dbt_project.yml:

models:
  dbt_tracking:
    near_realtime:
      +tags: near_realtime
      +labels:
        source: dbt_near_realtime
      +materialized: table

config for the models/near_realtime/stg_near_realtime_events.sql file:

{{
    config(
        labels={"task_id": "near_realtime"},
        materialized="incremental",
        incremental_strategy="merge",
        on_schema_change="append_new_columns",
        partition_by={
            "field": "derived_tstamp",
            "data_type": "timestamp",
            "granularity": "hour",
        },
        partition_expiration_days=1,
        cluster_by=["event_name"],
    )
}}

config for the models/near_realtime/int_near_realtime_aggs.sql file:

{{
    config(
        labels = {"task_id": "near_realtime"},
    )
}}

execution command:

dbt run --select tag:near_realtime

but when I search for my label in the

`my_project_id.region-eu.INFORMATION_SCHEMA.JOBS_BY_USER`

the only label.key I get via:

select distinct
  labels[SAFE_OFFSET(0)].key
from
  `my_project_id.region-eu.INFORMATION_SCHEMA.JOBS_BY_USER`
where
  timestamp_trunc(creation_time, day) = timestamp(current_date())

is this:

dbt_invocation_id

How can I query the jobs for my query so that I can see which total_bytes_billed my project run in dbt created for cost calculations?


Solution

  • Add this to your dbt_profile.yml:

    query-comment:
      job-label: True
      comment: "{{ var('query_label', 'dbt_run') }}"
    

    Then you can set a query label via:

    dbt run --select tag:my_tag --vars '{"query_label": "here_is_my_label"}'
    

    You can then analyse the total_bytes_billed per job query label in BQ via:

    with
      dbt_cloud_logs as (
          select
              datetime(creation_time, "Europe/Berlin")                                                                              as creation_time
            , (select unnested_labels.value from unnest(labels) as unnested_labels where unnested_labels.key = "dbt_invocation_id") as dbt_invocation_id
            , (select unnested_labels.value from unnest(labels) as unnested_labels where unnested_labels.key = "query_comment")     as dbt_query_label
            , total_bytes_billed
          from
              `region-eu.INFORMATION_SCHEMA.JOBS`
          where
              1=1
          and user_email                          = "your_service_account_or_email_adress"
          and timestamp_trunc(creation_time, day, "Europe/Berlin") >= timestamp("when_did_you_implement_job_query_labels?", "Europe/Berlin")
    )
    
    
    select
        *
    from
        dbt_cloud_logs
    where
        1=1
    and dbt_query_label is not null
    and dbt_invocation_id is not null
    order by
        dbt_query_label
      , creation_time desc