Search code examples
google-cloud-platformgoogle-bigquery

Getting cost value mismatch while using UNNEST in BigQuery


I am using BigQuery to get GCP usage data using queries. I have a single query to get all details from GCP exported Bill data to BigQuery.

select  sum(cost) as cost, sum(credits.amount) as credit FROM `TABLENAME`
        LEFT JOIN UNNEST(credits) AS credits
        WHERE  invoice.month="202404"

AND

select  sum(cost) as cost FROM `TABLENAME`
        WHERE  invoice.month="202404"

in both queries, I am getting different cost values. Why I am getting like this?

Above is a sample query.

Below one is my actual query. Here I am aggregating hourly data as daily data by filtering zero cost and credit amounts. in this scenario, I am getting multiple duplicate rows in result. It's Bigquery cost export data.


SELECT
              billing_account_id,
              service.id AS service_id,
              service.description AS service_des,
              sku.id AS sku_id,
              sku.description AS sku_des,
        FORMAT_DATETIME('%Y-%m-%d', usage_start_time) AS usage_date,
              location.location AS location,
              location.region AS region,
              location.country AS country,
              location.zone AS zone,
              invoice.month AS invoice_mon,
              currency,
              AVG(currency_conversion_rate) as currency_conversion_rate,
              cost_type,
        resource.name AS resource_name,
              resource.global_name AS res_global_name,
              SUM(usage.amount_in_pricing_units) AS usage_amount_pricing_units,
              usage.pricing_unit AS pricing_unit,
              SUM(cost) AS cost,
              SUM(cost_at_list) AS cost_at_list,
              transaction_type,
              seller_name,
              adjustment_info.id AS adjustment_info_id,
              adjustment_info.description AS adjustment_des,
              adjustment_info.type AS adjustment_type,
              adjustment_info.mode AS adjustment_mode,
              SUM(price.effective_price) AS effective_price,
              SUM(price.pricing_unit_quantity) AS pricing_unit_quantity,
              project.id AS project_id,
              project.number AS project_number,
              project.name AS project_name,
              SUM(IFNULL(credits.amount, 0)) AS credit_amount,
              credits.type  AS credit_type,
              credits.id AS credit_id,
              credits.name as credit_name,
              credits.full_name AS credit_full_name,
              TO_JSON_STRING(labels) AS labels,
              TO_JSON_STRING(system_labels) AS system_labels
        FROM `TABLENAME`
        LEFT JOIN  UNNEST(credits) AS credits
        WHERE (cost!=0 OR credits.amount!=0)  and invoice.month="202404"

        GROUP BY
              billing_account_id,
              service_id,
              service_des,
              sku_id,
              sku_des,
              usage_date,
              location,  
              region,
              country,
              zone,
              invoice_mon,
              currency,
              cost_type,
              resource_name,
              res_global_name,
              usage.pricing_unit,
              transaction_type,
              seller_name,
              adjustment_info_id,
              adjustment_des,
              adjustment_type,
              adjustment_mode,
              project_id,
              project_number,
              project_name,
              credit_type,
              credit_full_name,
              credit_id,
              credit_name,
              labels,
              system_labels   
        HAVING 
            (cost!=0 OR credit_amount!=0)
        ORDER BY
              usage_date

Edit:::


SELECT
              billing_account_id,
              service.id AS service_id,
              service.description AS service_des,
              sku.id AS sku_id,
              sku.description AS sku_des,
        FORMAT_DATETIME('%Y-%m-%d', usage_start_time) AS usage_date,
              location.location AS location,
              location.region AS region,
              location.country AS country,
              location.zone AS zone,
              invoice.month AS invoice_mon,
              currency,
              AVG(currency_conversion_rate) as currency_conversion_rate,
              cost_type,
        resource.name AS resource_name,
              resource.global_name AS res_global_name,
              SUM(usage.amount_in_pricing_units) AS usage_amount_pricing_units,
              usage.pricing_unit AS pricing_unit,
              SUM(IF(IFNULL(credits_offset, 0) = 0, cost, 0)) AS cost,
              SUM(cost_at_list) AS cost_at_list,
              transaction_type,
              seller_name,
              adjustment_info.id AS adjustment_info_id,
              adjustment_info.description AS adjustment_des,
              adjustment_info.type AS adjustment_type,
              adjustment_info.mode AS adjustment_mode,
              SUM(price.effective_price) AS effective_price,
              SUM(price.pricing_unit_quantity) AS pricing_unit_quantity,
              project.id AS project_id,
              project.number AS project_number,
              project.name AS project_name,
              SUM(IFNULL(credits.amount, 0)) AS credit_amount,
              credits.type  AS credit_type,
              credits.id AS credit_id,
              credits.name as credit_name,
              credits.full_name AS credit_full_name,
              TO_JSON_STRING(labels) AS labels,
              TO_JSON_STRING(system_labels) AS system_labels
        FROM `TABLENAME`
    LEFT JOIN UNNEST(credits) credits WITH OFFSET AS credits_offset
        WHERE (cost!=0 OR credits.amount!=0) 

        GROUP BY
              billing_account_id,
              service_id,
              service_des,
              sku_id,
              sku_des,
              usage_date,
              location,  
              region,
              country,
              zone,
              invoice_mon,
              currency,
              cost_type,
              resource_name,
              res_global_name,
              usage.pricing_unit,
              transaction_type,
              seller_name,
              adjustment_info_id,
              adjustment_des,
              adjustment_type,
              adjustment_mode,
              project_id,
              project_number,
              project_name,
              credit_type,
              credit_full_name,
              credit_id,
              credit_name,
              labels,
              system_labels   
        HAVING 
            (cost!=0 OR credit_amount!=0)
        ORDER BY
              usage_date

Solution

  • it's because of your LEFT JOIN creating a 1:N relationship, thus duplicating the data in cost column.

    Your table essentially looks like this:

    cost credits
    123 ARRAY[{amount: 1}, {amount: 5}]
    345 ARRAY[{amount: 3}]

    and the cost before unnesting is 468.

    When you unnest the array, you're left with flat table, which looks like this

    cost credits
    123 1
    123 5
    345 3

    and when you sum both, you get inflated numbers for cost, because of the duplication.

    I suggest unnesting the credits array with offset and only summing the cost with offset 0, eg. something like this:

    WITH cost_data AS (
                          SELECT
                              'compute'                                    AS sku
                            , 123                                          AS cost
                            , [STRUCT (1 AS amount), STRUCT (5 AS amount)] AS credits
                          UNION ALL
                          SELECT
                              'storage' AS sku
                            , 456       AS cost
                            , [STRUCT (3 AS amount)]
                      )
    SELECT sku, SUM(IF(credits_offset = 0, cost, 0)) AS cost, SUM(unn.amount) AS credit_amount
    FROM cost_data
    LEFT JOIN UNNEST(credits) unn WITH OFFSET AS credits_offset
    GROUP BY sku;
    

    EDIT: As pointed out in the below comment, this would only return rows where any credits were used, since I was filtering based on the offset and offset is present only if an unnested array is. In that case, just use SUM(IF(IFNULL(credits_offset, 0) = 0, cost, 0)) instead of SUM(IF(credits_offset = 0, cost, 0)), to include all rows.