I'm doing a fairly conventional point/polygon spatial join in order to determine the number of points that fall in each polygon. The result (from the left join) includes all of the polygons + the count where there is an intersection. But it doesn't return the polygon + 0 or null where there is no intersection. I'd like to still have the polygon geometry as it looks better on a map (vs. just a missing area).
WITH
geoidandcount AS(
SELECT
a.geo_id,
COUNT(b.latitude) AS count
FROM
`polygon_dataset` a
LEFT OUTER JOIN -- ERROR
`point_dataset` b
ON
st_contains(a.the_geom,
ST_GEOGPOINT(b.longitude,
b.latitude))
GROUP BY
a.geo_id )
SELECT
a.*,
b.the_geom,
a.count,
FROM
geoidandcount a
JOIN
`polygon_dataset` b
USING
(geo_id)
I had to do the CTE as you can't group by geography.
Yes, BigQuery cannot yet optimize geospatial OUTER JOIN, which is why you get this error. The workaround is to do geospatial INNER JOIN followed by OUTER JOIN on some unique ID in the OUTER table.
To demonstrate on simpler example, let's take this part of the query:
SELECT
a.geo_id,
b.latitude
FROM
`polygon_dataset` a
LEFT OUTER JOIN -- ERROR
`point_dataset` b
ON
st_contains(a.the_geom, ST_GEOGPOINT(b.longitude, b.latitude))
Assuming geo_id
is unique in the left table, this can be replaced with
SELECT
a.geo_id,
c.latitude
FROM
`polygon_dataset` a
LEFT OUTER JOIN
(
SELECT
a.geo_id
b.latitude
FROM
`polygon_dataset` a
JOIN
`point_dataset` b
ON
st_contains(a.the_geom, ST_GEOGPOINT(b.longitude, b.latitude))
) c
ON a.geo_id = c.geo_id
Here the count would be NULL when nothing is matched on the right, so let's use COALESCE(count, 0)
to get 0.
The full query will be something like
SELECT
a.geo_id,
a.the_geom,
COALESCE(c.count, 0) AS count
FROM
`polygon_dataset` a
LEFT OUTER JOIN
(
SELECT
a.geo_id
COUNT(b.latitude) as count
FROM
`polygon_dataset` a
JOIN
`point_dataset` b
ON
st_contains(a.the_geom, ST_GEOGPOINT(b.longitude, b.latitude))
GROUP BY
geo_id
) c
ON a.geo_id = c.geo_id