I am trying to get a rough estimate of how my ETL pipeline is performing per execution as well as on a daily basis. I am querying the BQ job history and doing some minor aggregation in order to get my values for estimated cost, average execution run time, GB billed, slot time, etc.
Because each execution is comprised of several steps (i.e., I have some pre operations such as variable declarations, temp tables created, etc.), when I look through the BQ job history, each execution appears in several steps with different statement_types
and their respective sub-queries, but then also as a whole SCRIPT
for which the whole query appears. In these two cases, the query linked to the SCRIPT
statement type is the same one as the one you'd get if you put together all of the individual sub-queries with their respective statement_types. The metrics I obtain for the SCRIPT type are virtually the same as the one that is broken up in different steps with a few minor differences.
Below is a query which gives me the calculated cost for one single execution
WITH main AS (
SELECT
job_id,
start_time,
DATE(start_time) AS execution_date,
statement_type,
query,
CASE WHEN statement_type LIKE 'SCRIPT' THEN 'whole_job' ELSE 'steps' END AS type,
TIMESTAMP_DIFF(end_time, start_time, SECOND) AS duration_seconds,
ROUND(total_bytes_processed/1073741824, 2) AS gigabytes_processed,
ROUND(total_bytes_billed/1073741824, 2) AS gigabytes_billed,
ROUND(total_slot_ms/3600000, 2) AS total_slot_hours,
ROUND(0.04*(total_slot_ms/3600000), 4) AS query_cost_usd
FROM
`region-eu`.INFORMATION_SCHEMA.JOBS
WHERE
DATE(start_time) = '2023-04-29'
AND
user_email = 'dataform@<some_project>.iam.gserviceaccount.com'
AND
query NOT LIKE '%<some_query>%'
ORDER BY
start_time
LIMIT 14
)
SELECT
execution_date,
type,
ROUND(AVG(duration_seconds),3) AS avg_execution_duration_daily,
ROUND(SUM(gigabytes_processed),3) AS total_execution_gb_billed,
ROUND(SUM(total_slot_hours),3) AS total_execution_slot_hrs,
ROUND(SUM(query_cost_usd),2) AS total_execution_cost,
FROM
main
WHERE
execution_date = '2023-04-29'
GROUP BY
1,2
ORDER BY
1 ASC
My question is, do I treat the two results as two separate components that make up one execution, or they representing the same one? Please let me know if my question doesn't make sense, I'm happy to clarify.