Search code examples
google-bigquerydata-warehouse

When to use partitioning in BigQuery


I was hopping to use your experience in BigQuery to decide the best structure for my data, I have 5 years experience in relational database especially Oracle DB, and I am familiar with the best practices to use in relational database to build a data warehouse, but when it comes to cloud solution, I'm still new

My question is about partitioned table in BigQuery, As far as I know we only have partition on day in BigQuery,

so a lot of my data have low count of rows on daily basis (between 1K to 12K max) , but they contains data for a long period of time, so is it a good idea to partition the table based on the day? if I partitioned my data into tables for each year, and created a view to represent all the tables, is this going to help? or is there any other solution on BigQuery. In your experience, when it's efficient to use partitioning in BQ, how large should the daily data be?

if you have any reference or book about best practices in using BQ for DWH please provide it to me.


Solution

  • so a lot of my data have low count of rows on daily basis (between 1K to 12K max) , but they contains data for a long period of time, so is it a good idea to partition the table based on the day?

    That would depend on how wide timeframe is covered by the queries you are going to use.

    Case A.
    For example, if the maximum duration of your queries is 1 week then partitioning into daily partitions can be very beneficial in terms of speed and cost because the engine wouldn't need to scan the whole table. It would select 7 daily partitons only.
    Example: For the public dataset bigquery-public-data.samples.github_timeline, a query that covers 1 day results in data usage of ~hundreds of megabytes due to table scan. Working with a similar daily partitioned table reduces the data usage incurred by 1 day query to ~10 MB.

    Case B.
    On the other hand, if the most frequently run queries cover timeframe that is one or two years wide (e.g. timewise the whole span of data in your table), then having daily partitons doesn't help much, the engine needs to process nearly all the table anyway. So having daily partitions won't help. Clustering can have detrimental effect on data usage if the amount of daily data is less than the minimal cluster size because the engine will end up processing more disk space.

    It would be helpful if similar considerations were documented along with the information what the actual minimal cluster size is and what the minimal partition size is (if any). If these figures are dynamic then what are the expected ranges.

    Useful answers:
    1
    2
    3