Search code examples
google-bigquerydatabase-partitioningclustered-index

How can I improve the amount of data queried with a partitioned+clustered table?


I have a BigQuery table - day partitioned, and clustered. However, it still uses a lot of data when I run queries over it. How is this possible?


Solution

  • Sometimes no partitions, or weekly/monthly/yearly partitions will work way better than having a daily partitioned table + clustering.

    This because each cluster of data in BigQuery has a minimum size. If each day of data in a daily partitioned table has less than that amount of data, you won't see any benefits at all from clustering your table.

    For example, let's create a table with 30+ years of weather. I will partition this table by month (to fit multiple years into one table):

    CREATE TABLE `temp.gsod_partitioned`
    PARTITION BY date_month
    CLUSTER BY name
    AS 
    SELECT *, DATE_TRUNC(date, MONTH) date_month
    FROM `fh-bigquery.weather_gsod.all` 
    

    Now, let's run a query over it - using the clustering field name:

    SELECT name, state, ARRAY_AGG(STRUCT(date,temp) ORDER BY temp DESC LIMIT 5) top_hot, MAX(date) active_until
    FROM `temp.gsod_partitioned`
    WHERE name LIKE 'SAN FRANC%'
    AND date > '1980-01-01'
    GROUP BY 1,2
    ORDER BY active_until DESC 
    # (2.3 sec elapsed, 3.1 GB processed)
    

    Now, let's do this over an identical table - partitioned by a fake date (so no partitioning really), and clustered by the same column:

    SELECT name, state, ARRAY_AGG(STRUCT(date,temp) ORDER BY temp DESC LIMIT 5) top_hot, MAX(date) active_until
    FROM `fh-bigquery.weather_gsod.all` 
    WHERE name LIKE 'SAN FRANC%'
    AND date > '1980-01-01'
    GROUP BY 1,2
    ORDER BY active_until DESC
    # (1.5 sec elapsed, 62.8 MB processed)
    

    Only 62.8 MB of data (vs 3.1GB) were processed!

    This because clustering without partitions is much more efficient on tables that don't have a lot of GB per day.

    Bonus: Clustered by geo:

    SELECT name, state, ARRAY_AGG(STRUCT(date,temp) ORDER BY temp DESC LIMIT 5) top_hot, MAX(date) active_until  
    FROM `fh-bigquery.weather_gsod.all_geoclustered`  
    WHERE date > '1980-01-01'
    AND ST_DISTANCE(point_gis, ST_GEOGPOINT(-122.465, 37.807)) < 40000
    GROUP BY 1,2
    ORDER BY ST_DISTANCE(ANY_VALUE(point_gis), ST_GEOGPOINT(-122.465, 37.807))
    # (2.1 sec elapsed, 100.7 MB processed)