Search code examples
google-bigquerydistancegeo

Optimize Distance Calculation in BigQuery


I am trying to optimize the below query in BigQuery.

  Table1 has ~400K rows
  Table2 has 34M rows

I have to map each ID in Table1 with the closest Zipcode in Table2.

Both Table1 and Table2 Have Latitude and Longitude Data

WITH
tmp1 AS (
SELECT
 ID, latitude, longitude 
 FROM `Table1`),

tmp2 AS (
SELECT
  CAST(ZipCode AS string) AS ZipCode ,lat,lon
  FROM `Table2` )

 SELECT
 AS VALUE ARRAY_AGG(STRUCT<ID STRING,ZipCode STRING, distance int64>(ID,
  ZipCode,
  CAST(ST_DISTANCE(tmp1.point,
      tmp2.point) AS int64))
  ORDER BY
  ST_DISTANCE(tmp1.point,
  tmp2.point)
  LIMIT
   1)[
  OFFSET
  (0)]
  FROM 
  (
  SELECT
 ID,ST_GEOGPOINT(longitude,latitude) point
  FROM tmp1) tmp1
  CROSS JOIN (
 SELECT
 ZipCode, ST_GEOGPOINT(lon, lat) point
  FROM tmp2) tmp2

Will appreciate any help on this!


Solution

  • BigQuery can very efficiently do spatial join, matching items across two tables that are within specific distance of each other. But you need to know that specific distance, or try several ones until you update all the points.

    This post discusses it in more details: https://medium.com/@mentin/nearest-neighbor-in-bigquery-gis-7d50ebd5d63

    You can automate it using BigQuery scripting, here is an idea, although it discussed a slightly different problem, a nearest geometry to a single point: https://medium.com/@mentin/nearest-neighbor-using-bq-scripting-373241f5b2f5