I work with PostGIS and quite large data of around 1.5 billion rows, I want to know which update is faster based on matching strings of one column with another column compared to spatial join (ST_Intersects)?
The query I mean is:
UPDATE table_grid AS gd
SET col_grid = col_adm
FROM table_adm de
WHERE ST_Intersects(de.geom, gd.geom);
versus:
UPDATE table_grid AS gd
SET col_grid = col_adm
FROM table_adm de
WHERE gd.col_match1 = de.col_match1
AND gd.col_match2 = de.col_match2
AND gd.col_match3 = de.col_match3;
Please give reasons why one query is faster than the other
As already underlined by the comments, without additional details that's a very abstract question. Based on your previous thread, I'll assume that:
In that case, the three-string match will be typically (much) faster. You can adjust the demo from the old thread and see the reason for yourself: demo
join
on 3 text
values is clear, precise and specific - Postgres knows the matched row based on the index alone. Text
-to-text
comparison in the Index Cond
is fast and there's no Filter
stage required.ST_Intersects()
, the index only helps the first stage where it narrows down match candidates using a less precise bounding box intersection operator &&
, but after that it has to perform a fairly expensive, vertex-by-vertex intersection check for all candidates on the Filter
stage. &&
descending down the R-tree is quick, but the final comparison has to do some planar/spherical geometry math, which will always be more expensive than a (hash) equality check it'd do if it just compared strings.On 380 admin areas and 20k grid cells, 3-text match takes 50ms
, st_intersects()
takes 300ms
.
I'm speeding up both updates by creating the table_grid
with fillfactor=50
- that way all pages of the table have 50% empty space so that when update
comes, the new version of the tuple can be saved right behind the old version.
Although in the demo I made these functionally equivalent by establishing the 3-text match based on the st_intersects()
in the first place, keep in mind the latter is undeterministic. As established in the old thread, without additional match conditions, whenever a grid cell intersects more than one admin area, Postgres will match it unpredictably. A deterministic version of that intersection-based match will be even slower.