I am trying to optimize the below query in BigQuery.
Table1 has ~400K rows
Table2 has 34M rows
I have to map each ID in Table1 with the closest Zipcode in Table2.
Both Table1 and Table2 Have Latitude and Longitude Data
WITH
tmp1 AS (
SELECT
ID, latitude, longitude
FROM `Table1`),
tmp2 AS (
SELECT
CAST(ZipCode AS string) AS ZipCode ,lat,lon
FROM `Table2` )
SELECT
AS VALUE ARRAY_AGG(STRUCT<ID STRING,ZipCode STRING, distance int64>(ID,
ZipCode,
CAST(ST_DISTANCE(tmp1.point,
tmp2.point) AS int64))
ORDER BY
ST_DISTANCE(tmp1.point,
tmp2.point)
LIMIT
1)[
OFFSET
(0)]
FROM
(
SELECT
ID,ST_GEOGPOINT(longitude,latitude) point
FROM tmp1) tmp1
CROSS JOIN (
SELECT
ZipCode, ST_GEOGPOINT(lon, lat) point
FROM tmp2) tmp2
Will appreciate any help on this!
BigQuery can very efficiently do spatial join, matching items across two tables that are within specific distance of each other. But you need to know that specific distance, or try several ones until you update all the points.
This post discusses it in more details: https://medium.com/@mentin/nearest-neighbor-in-bigquery-gis-7d50ebd5d63
You can automate it using BigQuery scripting, here is an idea, although it discussed a slightly different problem, a nearest geometry to a single point: https://medium.com/@mentin/nearest-neighbor-using-bq-scripting-373241f5b2f5