Search code examples
google-cloud-platformgoogle-bigquery

What would be the best strategy if you need to partition a table by a field, but the resulting number of partitions exceeds the limit?


Ok, so I have the following case:

I am using incremental strategy (using dbt and airflow) to feed data to a BigQuery table, BigQuery has a limit of 10,000 partitions, so I have the table partitioned by date and clustered by account.

The thing is that this performs well when it comes to increasingly adding data several times a day, but for example, if I needed to bring historic data for a certain account I have to re-create the whole table. I think that this could be better if I could partition the table by accounts, but there are like more than 50k accounts.

Is there a sharding or merging strategy or something else that I could follow given this scenario? Are there best practices when it comes to cases like this in BigQuery?


Solution

  • At load time, synthesize a column that holds a partial account id that fits within the partition limit.

    For example: if account_id is a number in the range 100,000 to 150,000, divide it to give ~10K values such as (account_id / 5) as account_set. That will give you values with the range 20,000 to 30,000.

    Not, if you have non-numeric account_ids, this will require a deterministic mapping to a numeric value for the account_set.

    Create your table as:

    CREATE TABLE account_data (..., account_id, account_set, ...)
    PARTITION BY account_set
    CLUSTER BY account_set, account_id
    

    When you query, specify the account_set and account_id in the predicate. BigQuery will select the partition using the account_set and use the account_set, account_id to select the subset of records from the partition.

    One caveat is partition_expiration_days will affect all accounts that fall in that partition. You'll need a different approach if you need to expire data.