Search code examples
sqlgoogle-bigquerydbt

How to choose the latest partition of a bigquery table in DBT without scanning the whole table?


I'm trying to select the latest partition from a BigQuery table without scanning the whole table in a DBT model in order to save query costs.

DBT doesnt allow using semicolons in a data model so using the DECLARE+SET scripting statements doesn't work as suggested here.

DBT has a sql_header macro which allows setting some variables in the header but that header doesn't accept references to a data model or at least the following code is not compiling:

{{ config(
  sql_header="  DECLARE latest_partition_date DATE;
  DECLARE latest_load_timestamp TIMESTAMP;
  SET latest_partition_date = (SELECT MAX(_PARTITIONDATE) FROM {{ ref("model") }} );
  SET latest_load_timestamp = (SELECT MAX(loaded_at) FROM {{ ref("model") }} WHERE _PARTITIONDATE = latest_partition_date);"
) }}

-- set the main query
SELECT * FROM {{ ref("model") }}
WHERE 
-- Select the latest partition to reduce 'Bytes processed' for loading the query.
_PARTITIONDATE = latest_partition_date
-- Select the latest load within the latest partition to get only one duplicate of data.
AND loaded_at = latest_load_timestamp

I need to solve this in standard SQL.

Other methods that were suggested included setting WHERE _PARTITIONDATE = CURRENT_DATE() or using DATE_SUB(CURRENT_DATE(), 3) but those don't satisfy because data load breakages are unpredictable and only dynamically selecting the latest would work here. Is that possible?


Solution

  • Since the original question was working with dates, the right datatype conversion was the missing piece.

    In the end I figured that the conversion to the right datatype needs to be done within jinja and not with SQL for the queries to accept right variables. Also, {{ max_date }} needed quotes.

    The final solution that I got working was this:

    
    {%- call statement('max_partition_date_query', True) -%}
      SELECT MAX(_PARTITIONDATE) as max_partition_date FROM {{ ref('model') }}
    {%- endcall -%}
    
    {%- set max_timestamp = load_result('max_partition_date_query')['data'][0][0] -%}
    {%- set max_date = max_timestamp.strftime('%Y-%m-%d') -%}
    
    select * FROM {{ ref('model') }}
    WHERE _PARTITIONDATE = '{{ max_date }}'