Search code examples
sqlgoogle-bigqueryclustering-key

Querying a BigQuery table clustered by geography does not improve performance when intersecting with geographies from other tables


I have a large (250 Gb) BigQuery table representing a grid of values, with the grid cells stored as polygons. The table is clustered by this geography column. I have a second table with various polygons, and I want to get grid cell values within one of these polygons (For example, query some weather data within the boundaries of a specific country).

When I perform a spatial intersection of the grid table with a single polygon using ST_GEOGFROMGEOJSON, I see that the query cost (bytes processed) is low (100 Mb). But when I perform the same query with the polygon as a part of a select statement I see that the table is fully scanned and the cost is high (250 Gb).

This is a reproducible example. When querying the following public dataset using a specific polygon:

SELECT npp FROM `bigquery-public-data.modis_terra_net_primary_production.MODIS_MOD17A3HGF` 
WHERE year=2020
AND ST_INTERSECTS(geography_polygon, ST_GEOGFROMGEOJSON('{ "type": "Polygon", "coordinates": [ [ [ -101.1, 50 ], [ -101, 50 ], [ -101, 56 ], [ -101.1, 56 ], [ -101.1, 50 ] ] ] }'))

-- Bytes processed -- 785.45 MB -- Bytes billed -- 786 MB

With the same polygon as part of a select statement in a table:

CREATE OR REPLACE TABLE `tmp.polygon`
AS SELECT ST_GEOGFROMGEOJSON('{ "type": "Polygon", "coordinates": [ [ [ -101.1, 50 ], [ -101, 50 ], [ -101, 56 ], [ -101.1, 56 ], [ -101.1, 50 ] ] ] }') geog;

SELECT npp FROM `bigquery-public-data.modis_terra_net_primary_production.MODIS_MOD17A3HGF` 
WHERE year=2020
AND ST_INTERSECTS(geography_polygon, (SELECT geog FROM `tmp.polygon`))

-- Bytes processed -- 522.15 GB -- Bytes billed -- 522.15 GB

I got the same results using cross joins or inner joins.

Why doesn't clustering work with the select statements? Is there a way to decrease the query cost without adding custom clustering / partitioning columns?


Solution

  • This optimization is only implemented by BigQuery for constant geography in the filter.

    A work around is to split the query into two parts using BigQuery procedural language (https://cloud.google.com/bigquery/docs/reference/standard-sql/procedural-language).

    DECLARE poly GEOGRAPHY;
    SET poly = (SELECT geog FROM `tmp.polygon`);
    
    SELECT npp FROM `bigquery-public-data.modis_terra_net_primary_production.MODIS_MOD17A3HGF` 
    WHERE year=2020
    AND ST_INTERSECTS(geography_polygon, poly);
    

    This bills 786 MB.

    Update for the case when the polygon is too complex and causes evaluation limit error: let's store bounding box only, and post-filter using the actual polygon:

    DECLARE box STRUCT<minx FLOAT64, miny FLOAT64, maxx FLOAT64,  maxy FLOAT64>;
    SET box = (SELECT ST_BoundingBox(geog) FROM `tmp.polygon`);
    
    SELECT npp FROM `bigquery-public-data.modis_terra_net_primary_production.MODIS_MOD17A3HGF` 
    WHERE year=2020
    AND ST_INTERSECTS(geography_polygon, (SELECT geog FROM `tmp.polygon`))
    AND ST_INTERSECTSBOX(geography_polygon, box.minx, box.miny, box.maxx, box.maxy);
    

    This also bills 786 MB, although in rare cases it might bill more due to more lax filter.