Search code examples
google-bigqueryquery-optimizationgeospatial

Why do I get a different execution plan when I write the polygon as text compared to getting it from a table ? (Bigquery ST_INTERSECTS)


Using geospatial data in Bigquery, I'm trying to reduce the cost of my queries.

I have two queries, one where I filter on a polygon that I get from a table:

with my_polygon as (select polygon from `data-lake.extractions.temp_polygon` )
SELECT
    aid,
    start_time,
    point
  FROM `data-lake.mobility.mobility_agg` AS ma , my_polygon
  WHERE
    (date >= "2023-03-03" AND date <= "2023-03-03")
    AND dwell_time > 0
 and st_intersects(ma.point, my_polygon.polygon)

And one, where I type the same polygon in plain text:

with my_polygon as (select st_geogfromtext('POLYGON((2.3470047676083 48.8598609925218, 2.34673730167323 48.8598501955807, 2.34647817705671 48.8598052761174, 2.34623735224696 48.8597279604503, 2.34602408234192 48.8596212199183, 2.34584656332996 48.8594891566731, 2.34571161710046 48.8593368460093, 2.34562442929261 48.8591701412964, 2.3455883500551 48.8589954490133, 2.34560476536727 48.8588194825343, 2.34567304385666 48.8586490041307, 2.34579056114475 48.8584905651035, 2.34595280077252 48.8583502540342, 2.34615352781658 48.8582334628257, 2.34638502851587 48.8581446795206, 2.34663840669705 48.8580873158555, 2.34690392560853 48.8580635761734, 2.34717138203294 48.8580743727291, 2.34743049831106 48.858119290639, 2.34767131722413 48.8581966038215, 2.34788458457191 48.8583033413155, 2.34806210475537 48.8584354014321, 2.34819705570677 48.8585877093529, 2.34828425106805 48.8587544121261, 2.34832033954058 48.858929103568, 2.34830393373907 48.8591050704323, 2.34823566358807 48.8592755503892, 2.34811815219666 48.8594339919012, 2.34795591512614 48.8595743060084, 2.34775518691058 48.8596911003455, 2.3475236814894 48.8597798863936, 2.34727029575479 48.8598372519985, 2.3470047676083 48.8598609925218))') as polygon)
SELECT
    aid,
    start_time,
    point
  FROM `data-lake.mobility.mobility_agg` AS ma, my_polygon
  WHERE
    (date >= "2023-03-03" AND date <= "2023-03-03")
    AND dwell_time > 0
    and st_intersects(ma.point, my_polygon.polygon)

(mobility_agg is clustered by date and point & temp_polygon is clustered by polygon)

The two queries are equivalent but have different execution plans. For the first one, Bigquery first reads the whole mobility_agg table and then join on the polygon.

In the second one, Bigquery filters on the polygon while reading the mobility_agg table, thus substantially reducing the data scanned during this step.

I've tried different ways of constructing the first query, but couldn't, in any of them, push the filter on the polygon in the first reading step of the execution plan (to reduce the data scanned) if the polygon was stored in a table.

Does anyone know why this is the case and how I could change my query or table to reduce the volume scanned while using a polygon stored in a table?


Solution

  • In the first version, BigQuery does not know there is a single polygon, and does not know it in advance. The way to make first version cheap is to split the query in two, using BQ Scripting:

    DECLARE my_polygon GEOGRAPHY;
    SET my_polygon = (select polygon from `data-lake.extractions.temp_polygon` );
    
    SELECT
        aid,
        start_time,
        point
    FROM `data-lake.mobility.mobility_agg` AS ma
    WHERE
        (date >= "2023-03-03" AND date <= "2023-03-03")
        AND dwell_time > 0
     and st_intersects(ma.point, my_polygon)