Search code examples
sqlgoogle-bigquerypartitioningdbtdata-partitioning

Is changing date partitionning granularity a breaking change?


In Bigquery, suppose I create a table and partition it by a date column "mydate" with a "DAY" granularity.

Using DBT, this can be done using :

partition_by = {
            "field":"mydate",
            "granularity":"DAY",
            "data_type":"DATE"
        }

Then suppose I expose my table and it starts to be used by some consumers.

Then one day I decide to change the partition granularity to "MONTH", that is, the table will still be partitionned by "mydate", except that now the granularity used is "MONTH":

In DBT:

partition_by = {
            "field":"mydate",
            "granularity":"MONTH",
            "data_type":"DATE"
        }

Is this a breaking change (like when we remove a column) ? What would be the impacts of this granularity change ?

This would probably cause some performance drop on consumers side (if for example in their queries they scan specific dates, because after the granularity change their queries will have to scan at least one entire month). But I may be missing some more critical consequences.


Solution

  • When you partition a BigQuery table, all data associated with the partition is internally stored in the same partition storage. When you partitioned by day, think (logically) that the data for a given day would all be "together" in one "file". When you submit a query that references the partitioning column, BigQuery can then determine which files to examine and which can be ignored (prunining). If the table were repartitioned to be by month, the previously working queries would still work (partitioning is merely an optimization) ... however, now when a query is executed, it is likely more data will have to be examined. There is a limit to the number of partitions you can have (currently, 10,000 used to be 4000).

    However, I do not believe that you can change a partition range on an existing table. I have always seen the table partition settings set when the table is created. It is likely that you would have to recreate the table with the new partitioning. This is because BigQuery has to re-arrange the partition files ... in your example, instead of having a file by day (as it is today) you would have a file per month. BigQuery would have to re-build those tables from your current day partitioning.