Search code examples
sqlpostgresqlindexingpostgis

Efficiency of SQL query that performs joins using temporal/spatial bounds


I'm using PostgreSQL (v. 12.14) and PostGIS to model GPS tracks that pass through "areas". I've created a materialized view that maintains an array of areas-passed-through per track, and the update performance is godawful. I'm trying to understand how the (fairly naive) query works, and how I can improve it.

Here are the relevant parts of the schema:

create table track (
    start timestamp,
    end timestamp,
    user text
)

create table gps_point (
    create_time timestamp,
    point geometry(point, 4326),
    user text
)

create table area (
    name text,
    polygon geometry(polygon, 4326)
)

Note there are no foreign keys between these tables -- I did not create this schema, and would have at least added a foreign key from gps_point to track.

The guts of the view are generated as:

select track.start, track.end, array_agg(distinct area.name)
  from track
    join gps_point on (gps_point.create_time between track.start and track.end
                       and gps_point.user = track.user)
    join area on st_covers(area.polygon, gps_point.point)
  group by track.start, track.end

So effectively I'm using a temporal between to achieve one half of the join, and a spatial st_covers to do the other half.

This is ungodly slow (I have indexes on basically all the columns shown above except for area.name). In principle I understand that the query is piling up far, far more rows than it needs (hence the distinct area.name), and that there might be a way to "short circuit": there's an enormous amount of data in the gps_point table, but all I need is a single ping within an area to know that it should be included. That feels like an "EXISTS", but I can't figure out how to get an "EXISTS" in there.

Here's the output of explain analyze, obviously the outer nested loop is where the whole thing is falling apart, but I don't know what work that represents:

 GroupAggregate  (cost=28495760.61..28768028.29 rows=3768 width=108) (actual time=51901.377..53247.436 rows=3055 loops=1)
   Group Key: track.user, track.start, track.end
   ->  Sort  (cost=28495760.61..28550204.73 rows=21777646 width=80) (actual time=51900.803..52624.699 rows=689231 loops=1)
         Sort Key: track.user, track.start, track.end
         Sort Method: external merge  Disk: 63488kB
         ->  Nested Loop  (cost=0.70..22938476.09 rows=21777646 width=80) (actual time=17.638..48055.263 rows=689231 loops=1)
               ->  Nested Loop  (cost=0.41..8420387.25 rows=46063701 width=72) (actual time=7.599..36250.753 rows=843071 loops=1)
                     ->  Seq Scan on area  (cost=0.00..6.75 rows=75 width=135) (actual time=0.028..0.892 rows=109 loops=1)
                     ->  Index Scan using point_idx on gps_point g  (cost=0.41..112267.42 rows=432 width=100) (actual time=3.387..320.934 rows=7735 loops=109)
                           Index Cond: (point @ area.polygon)
                           Filter: st_covers(area.polygon, point)
                           Rows Removed by Filter: 2048
               ->  Index Scan using track_user_start_idx on golfround gr  (cost=0.28..0.31 rows=1 width=76) (actual time=0.010..0.011 rows=1 loops=843071)
                     Index Cond: (((user)::text = (g.user)::text) AND (start <= g.create_time))
                     Filter: (g.create_time <= end)
                     Rows Removed by Filter: 2
 Planning Time: 12.674 ms
 Execution Time: 53259.611 ms

Lastly, some statistics about the tables:

Rows in track table: 4427
Average GPS points per track: 1000

How can I improve this?


Solution

  • In order to make use of an EXISTS subquery, you would need to do a cross join between track and area. That sounds like a pretty horrible idea, but in this case I would say it is actually worth a try.

    select track.start, track.end, area.name
      from track cross join area
      where exists (select 1 from gps_point 
          where gps_point.create_time between track.start and track.end
          and gps_point.user = track.user
          and  st_covers(area.polygon, gps_point.point)
      )
    

    This could be supported by a multi-column GiST index, something like:

    create index on gps_point using gist (point, "user", create_time)
    

    Where you probably need the btree_gist extension to include the scalar types into the index. It also isn't obvious which order of columns in the index would be optimal, you could try different orders and see which one works best.