I'm using PostgreSQL (v. 12.14) and PostGIS to model GPS tracks that pass through "areas". I've created a materialized view that maintains an array of areas-passed-through per track, and the update performance is godawful. I'm trying to understand how the (fairly naive) query works, and how I can improve it.
Here are the relevant parts of the schema:
create table track (
start timestamp,
end timestamp,
user text
)
create table gps_point (
create_time timestamp,
point geometry(point, 4326),
user text
)
create table area (
name text,
polygon geometry(polygon, 4326)
)
Note there are no foreign keys between these tables -- I did not create this schema, and would have at least added a foreign key from gps_point
to track
.
The guts of the view are generated as:
select track.start, track.end, array_agg(distinct area.name)
from track
join gps_point on (gps_point.create_time between track.start and track.end
and gps_point.user = track.user)
join area on st_covers(area.polygon, gps_point.point)
group by track.start, track.end
So effectively I'm using a temporal between
to achieve one half of the join, and a spatial st_covers
to do the other half.
This is ungodly slow (I have indexes on basically all the columns shown above except for area.name
). In principle I understand that the query is piling up far, far more rows than it needs (hence the distinct area.name
), and that there might be a way to "short circuit": there's an enormous amount of data in the gps_point
table, but all I need is a single ping within an area to know that it should be included. That feels like an "EXISTS", but I can't figure out how to get an "EXISTS" in there.
Here's the output of explain analyze
, obviously the outer nested loop is where the whole thing is falling apart, but I don't know what work that represents:
GroupAggregate (cost=28495760.61..28768028.29 rows=3768 width=108) (actual time=51901.377..53247.436 rows=3055 loops=1)
Group Key: track.user, track.start, track.end
-> Sort (cost=28495760.61..28550204.73 rows=21777646 width=80) (actual time=51900.803..52624.699 rows=689231 loops=1)
Sort Key: track.user, track.start, track.end
Sort Method: external merge Disk: 63488kB
-> Nested Loop (cost=0.70..22938476.09 rows=21777646 width=80) (actual time=17.638..48055.263 rows=689231 loops=1)
-> Nested Loop (cost=0.41..8420387.25 rows=46063701 width=72) (actual time=7.599..36250.753 rows=843071 loops=1)
-> Seq Scan on area (cost=0.00..6.75 rows=75 width=135) (actual time=0.028..0.892 rows=109 loops=1)
-> Index Scan using point_idx on gps_point g (cost=0.41..112267.42 rows=432 width=100) (actual time=3.387..320.934 rows=7735 loops=109)
Index Cond: (point @ area.polygon)
Filter: st_covers(area.polygon, point)
Rows Removed by Filter: 2048
-> Index Scan using track_user_start_idx on golfround gr (cost=0.28..0.31 rows=1 width=76) (actual time=0.010..0.011 rows=1 loops=843071)
Index Cond: (((user)::text = (g.user)::text) AND (start <= g.create_time))
Filter: (g.create_time <= end)
Rows Removed by Filter: 2
Planning Time: 12.674 ms
Execution Time: 53259.611 ms
Lastly, some statistics about the tables:
Rows in track table: 4427
Average GPS points per track: 1000
How can I improve this?
In order to make use of an EXISTS subquery, you would need to do a cross join between track and area. That sounds like a pretty horrible idea, but in this case I would say it is actually worth a try.
select track.start, track.end, area.name
from track cross join area
where exists (select 1 from gps_point
where gps_point.create_time between track.start and track.end
and gps_point.user = track.user
and st_covers(area.polygon, gps_point.point)
)
This could be supported by a multi-column GiST index, something like:
create index on gps_point using gist (point, "user", create_time)
Where you probably need the btree_gist extension to include the scalar types into the index. It also isn't obvious which order of columns in the index would be optimal, you could try different orders and see which one works best.