Search code examples
sqlgoogle-bigquerygeospatialouter-joinspatial-query

How to do an OUTER join with a spatial join in BigQuery to be able to get a count value of 0 where there is no intersection?


I'm doing a fairly conventional point/polygon spatial join in order to determine the number of points that fall in each polygon. The result (from the left join) includes all of the polygons + the count where there is an intersection. But it doesn't return the polygon + 0 or null where there is no intersection. I'd like to still have the polygon geometry as it looks better on a map (vs. just a missing area).

 WITH
  geoidandcount AS(
  SELECT
    a.geo_id,
    COUNT(b.latitude) AS count
  FROM
    `polygon_dataset` a
  LEFT OUTER JOIN -- ERROR
    `point_dataset` b
  ON
    st_contains(a.the_geom,
      ST_GEOGPOINT(b.longitude,
        b.latitude))
  GROUP BY
    a.geo_id )
SELECT
  a.*,
  b.the_geom,
  a.count,
FROM
  geoidandcount a
JOIN
  `polygon_dataset` b
USING
  (geo_id)

I had to do the CTE as you can't group by geography.

enter image description here


Solution

  • Yes, BigQuery cannot yet optimize geospatial OUTER JOIN, which is why you get this error. The workaround is to do geospatial INNER JOIN followed by OUTER JOIN on some unique ID in the OUTER table.

    To demonstrate on simpler example, let's take this part of the query:

      SELECT
        a.geo_id,
        b.latitude
      FROM
        `polygon_dataset` a
      LEFT OUTER JOIN -- ERROR
        `point_dataset` b
      ON
        st_contains(a.the_geom, ST_GEOGPOINT(b.longitude, b.latitude))
    

    Assuming geo_id is unique in the left table, this can be replaced with

      SELECT
        a.geo_id,
        c.latitude
      FROM
        `polygon_dataset` a
      LEFT OUTER JOIN
      (
        SELECT 
          a.geo_id
          b.latitude
        FROM
          `polygon_dataset` a
        JOIN
          `point_dataset` b
        ON
          st_contains(a.the_geom, ST_GEOGPOINT(b.longitude, b.latitude))
      ) c
      ON a.geo_id = c.geo_id
    

    Here the count would be NULL when nothing is matched on the right, so let's use COALESCE(count, 0) to get 0.

    The full query will be something like

      SELECT
        a.geo_id,
        a.the_geom,
        COALESCE(c.count, 0) AS count
      FROM
        `polygon_dataset` a
      LEFT OUTER JOIN
      (
        SELECT 
          a.geo_id
          COUNT(b.latitude) as count
        FROM
          `polygon_dataset` a
        JOIN
          `point_dataset` b
        ON
          st_contains(a.the_geom, ST_GEOGPOINT(b.longitude, b.latitude))
        GROUP BY
          geo_id
      ) c
      ON a.geo_id = c.geo_id