Ok, so I have the following case:
I am using incremental strategy (using dbt and airflow) to feed data to a BigQuery table, BigQuery has a limit of 10,000 partitions, so I have the table partitioned by date and clustered by account.
The thing is that this performs well when it comes to increasingly adding data several times a day, but for example, if I needed to bring historic data for a certain account I have to re-create the whole table. I think that this could be better if I could partition the table by accounts, but there are like more than 50k accounts.
Is there a sharding or merging strategy or something else that I could follow given this scenario? Are there best practices when it comes to cases like this in BigQuery?
At load time, synthesize a column that holds a partial account id that fits within the partition limit.
For example: if account_id
is a number in the range 100,000 to 150,000, divide it to give ~10K values such as (account_id / 5) as account_set
. That will give you values with the range 20,000 to 30,000.
Not, if you have non-numeric account_id
s, this will require a deterministic mapping to a numeric value for the account_set
.
Create your table as:
CREATE TABLE account_data (..., account_id, account_set, ...)
PARTITION BY account_set
CLUSTER BY account_set, account_id
When you query, specify the account_set
and account_id
in the predicate. BigQuery will select the partition using the account_set
and use the account_set, account_id
to select the subset of records from the partition.
One caveat is partition_expiration_days
will affect all accounts that fall in that partition. You'll need a different approach if you need to expire data.