Search code examples
google-cloud-platformgoogle-bigqueryquery-optimization

What's the effect of the "Bytes Shuffled" metric from BigQuery on cost?


I'm optimizing a query in BigQuery and I managed to reduce all performance metrics by a good margin except for the "Bytes Consumed" metric which increased from 3GB to 3.56GB

I would like to know if there is an impact of the Bytes Shuffled metric on cost, and if so by how much?


Solution

  • To understand that, you have to have in mind the BigQuery architecture. It's more or less a Map Reduce architecture.

    Map can be done on a single node (filter, transform, ...). Reduce require node communication to perform operation (join, substracts,...).

    Of course, map operation are much more efficient than reduce operation (only in memory, no network communication, no synchronisation/wait,...)


    Byte shuffling is the byte shared between the nodes.


    The cost perspective is not simple to answer. If you pay as you use BigQuery (no slots reservation) there is no extra cost (the same volume of data are processed, therefore no impact, only a slower query).

    If you have reserved slots (node and slots are similar), there is no extra cost also. But you keep the slots longer (the query is slower and the slot usage longer), and if you share the slots with other users/queries/projects, it can impact the overall performance, and, maybe the overall cost of your projects.

    So, no direct cost, but a global overview to have about the duration impact.