Search code examples
sqlgoogle-bigqueryetlpipelinedata-analysis

SQL BigQuery: Querying job execution logs


I am trying to get a rough estimate of how my ETL pipeline is performing per execution as well as on a daily basis. I am querying the BQ job history and doing some minor aggregation in order to get my values for estimated cost, average execution run time, GB billed, slot time, etc.

Because each execution is comprised of several steps (i.e., I have some pre operations such as variable declarations, temp tables created, etc.), when I look through the BQ job history, each execution appears in several steps with different statement_types and their respective sub-queries, but then also as a whole SCRIPT for which the whole query appears. In these two cases, the query linked to the SCRIPT statement type is the same one as the one you'd get if you put together all of the individual sub-queries with their respective statement_types. The metrics I obtain for the SCRIPT type are virtually the same as the one that is broken up in different steps with a few minor differences.

Below is a query which gives me the calculated cost for one single execution

WITH main AS (


   SELECT 
      job_id,
      start_time,
      DATE(start_time) AS execution_date,
      statement_type,
      query,
      CASE WHEN statement_type LIKE 'SCRIPT' THEN 'whole_job' ELSE 'steps' END AS type,
      TIMESTAMP_DIFF(end_time, start_time, SECOND) AS duration_seconds, 
      ROUND(total_bytes_processed/1073741824, 2) AS gigabytes_processed,
      ROUND(total_bytes_billed/1073741824, 2) AS gigabytes_billed,
      ROUND(total_slot_ms/3600000, 2) AS total_slot_hours,
      ROUND(0.04*(total_slot_ms/3600000), 4) AS query_cost_usd
  FROM 
    `region-eu`.INFORMATION_SCHEMA.JOBS
  WHERE 
    DATE(start_time) = '2023-04-29'
  AND 
    user_email = 'dataform@<some_project>.iam.gserviceaccount.com'
  AND 
    query NOT LIKE '%<some_query>%'
  ORDER BY 
    start_time
  LIMIT 14
)

SELECT 
    execution_date,
    type,
    ROUND(AVG(duration_seconds),3) AS avg_execution_duration_daily,
    ROUND(SUM(gigabytes_processed),3) AS total_execution_gb_billed,
    ROUND(SUM(total_slot_hours),3) AS total_execution_slot_hrs,
    ROUND(SUM(query_cost_usd),2) AS total_execution_cost,
  FROM 
    main
  WHERE 
    execution_date = '2023-04-29'
  GROUP BY 
    1,2
  ORDER BY 
    1 ASC

Here are the results: enter image description here

My question is, do I treat the two results as two separate components that make up one execution, or they representing the same one? Please let me know if my question doesn't make sense, I'm happy to clarify.


Solution

  • Since the script contains the collection of the steps, it can be considered the same as the one represented by whole job. And hence you can see equal slots and execution cost consumed for both the script and the sub-queries.