Search code examples
jinja2dbt

DBT Cloud parse_timestamp on sharded tables


I have a dbt cloud project which queries some BigQuery data, applies some transformations and returns some datamarts. The source data is a collection of sharded tables i.e. GA4 events tables

I want to test for source freshness. aka return a warning if the new events table is older than 24h.

This is how I meant to go about it: Check the name of the new events table i.e. events_20231123, parse the suffix (20231123) as a timestamp, and check if that timestamp is older than 24h.

version: 2

sources:

  - name: google_analytics
    description: "Google Analytics 4 export"
    database: my_project_name
    schema: analytics_00000000 #the name of the BQ dataset
    tables:
      - name: events
        identifier: events_*
        loaded_at_field: parse_timestamp(_table_suffix, '%Y%m%d')
        freshness:
          warn_after:
            count: 24
            period: hour
          error_after:
            count: 36
            period: hour

Which returns the following error: 07:00:43 1 of 1 ERROR freshness of google_analytics.events .............................. [ERROR in 3.25s] 07:00:43
07:00:43 07:00:43 Database Error in source events (models/_source/_sources.yml) Mismatch between format character '2' and string character '%' at string index: 0 with format: '20231120' 07:00:43 Done.

How may I go about this? Should I just come up with a timestamp column in my sharded tables and test the refresh on this column? Why isn't my regex working?

thank you for your time.


Solution

  • Two things:

    1. in BQ's parse_timestamp() function, you first need to specify the format, and then the value. E.g. parse_timestamp('%Y%m%d', _table_suffix). See docs
    2. why are you using the table partition? If you want to get the exact hours, this will just give you the _table_suffix at 00:00:00. Why don't you use the event_timestamp from each record instead? Not perfect, but at least you will exactly know from which timestamp is each record.