I'm trying to select the latest partition from a BigQuery table without scanning the whole table in a DBT model in order to save query costs.
DBT doesnt allow using semicolons in a data model so using the DECLARE
+SET
scripting statements doesn't work as suggested here.
DBT has a sql_header macro which allows setting some variables in the header but that header doesn't accept references to a data model or at least the following code is not compiling:
{{ config(
sql_header=" DECLARE latest_partition_date DATE;
DECLARE latest_load_timestamp TIMESTAMP;
SET latest_partition_date = (SELECT MAX(_PARTITIONDATE) FROM {{ ref("model") }} );
SET latest_load_timestamp = (SELECT MAX(loaded_at) FROM {{ ref("model") }} WHERE _PARTITIONDATE = latest_partition_date);"
) }}
-- set the main query
SELECT * FROM {{ ref("model") }}
WHERE
-- Select the latest partition to reduce 'Bytes processed' for loading the query.
_PARTITIONDATE = latest_partition_date
-- Select the latest load within the latest partition to get only one duplicate of data.
AND loaded_at = latest_load_timestamp
I need to solve this in standard SQL.
Other methods that were suggested included setting WHERE _PARTITIONDATE = CURRENT_DATE()
or using DATE_SUB(CURRENT_DATE(), 3)
but those don't satisfy because data load breakages are unpredictable and only dynamically selecting the latest would work here. Is that possible?
Since the original question was working with dates, the right datatype conversion was the missing piece.
In the end I figured that the conversion to the right datatype needs to be done within jinja and not with SQL for the queries to accept right variables. Also, {{ max_date }}
needed quotes.
The final solution that I got working was this:
{%- call statement('max_partition_date_query', True) -%}
SELECT MAX(_PARTITIONDATE) as max_partition_date FROM {{ ref('model') }}
{%- endcall -%}
{%- set max_timestamp = load_result('max_partition_date_query')['data'][0][0] -%}
{%- set max_date = max_timestamp.strftime('%Y-%m-%d') -%}
select * FROM {{ ref('model') }}
WHERE _PARTITIONDATE = '{{ max_date }}'