Search code examples
sqlgoogle-cloud-dataproc

Can I derive the number of compute hours dataproc clusters have accrued from the billing data?


I would like to know if its possible to get the total number of compute hours that are spent on dataproc instances, by looking at the billing data.

N.B. Just to reiterate...I'm not interested in the number of hours that clusters existed for, I want to know the total compute hours.

We export our billing data to BigQuery and I've run this query:

select  cost_grouping,cast(sum(hours) as int64) as hours
from   (
       select case when sku_description like 'Licensing Fee for Google Cloud Dataproc%' then sku_description
                else 'vm_compute'
               end as cost_grouping
        ,      hours
        from (
            select sku.description as sku_description   ,usage.amount_in_pricing_units  as hours
            from `billing.gcp_billing_export`  
            --we have a workload label on our dataproc clusters that we can interrogate to get all of our dataproc costs
            where REGEXP_EXTRACT(TO_JSON_STRING(labels), r'"key":"workload","value":"([^,:]+)"') like 'dataproc%'
              and usage.pricing_unit  = 'hour' 
            )
       )
group by cost_grouping

Which gives me this result:

+----------------------------------------------------+-----------+--+
|                   cost_grouping                    |   hours   |  |
+----------------------------------------------------+-----------+--+
| Licensing Fee for Google Cloud Dataproc (GPU cost) |         1 |  |
| Licensing Fee for Google Cloud Dataproc (CPU cost) |   8231009 |  |
| vm_compute                                         |   8230779 |  |
+----------------------------------------------------+-----------+--+

To clarify, the vm_compute hours are basically the number of core hours.

Its gratifying to see that

(Licensing Fee for Google Cloud Dataproc (GPU cost)) + (Licensing Fee for Google Cloud Dataproc (CPU cost)) - (vm_compute) = 231

That result is close enough to zero, I'm not going to lose sleep that its not exactly zero.

I'm assuming then that summing the hours for Licensing Fee for Google Cloud Dataproc (GPU cost) & Licensing Fee for Google Cloud Dataproc (CPU cost) is an accurate reflection of the total number of compute hours spent by Dataproc. Can anyone out there confirm to me that this is the case? Are there any SKUs that don't (yet) show up in our data but which might do in the future that I should consider?


Solution

  • The "GPU cost" item should be an orthogonal overlapping item, and actually is multiplied by "0" for the actual pricesheet since Dataproc doesn't currently charge any Dataproc-specific premium over underlying GPU usage. So if you're interested in the compute hours that are subject to actual Dataproc pricing you should only look at "CPU cost". Other than that, you're right that looking at Licensing Fee for Google Cloud Dataproc (CPU cost) should be an accurate count of compute hours spent by Dataproc.

    One caveat to keep in mind is that if this is based on flattened labels records, then if you ever match a regex on the "key" then you might end up with duplicate counts of the same underlying usage item; see Understanding GCP Dataproc billing and how it is affected by labels -- for example, if you were to sum up labels with key like %dataproc% you'd end up triple or quadruple-counting dataproc hours.

    In your case as long as you're filtering on a single unique "workload" key the number should be correct.