Search code examples
postgresqlpostgisknnspatial-indexlateral-join

KNN Across categories in postgis using indexing


I have a dataset of points of different types. For every point in the dataset I want to find the closest point in every category. I can achieve this but the compute time is very long and I'm struggling to get the query to use a spatial index for the KNN in tandem with the type information in an effiecient way.

Sample data generation

CREATE TYPE point_type AS ENUM ('1','2','3','4','5');

CREATE TABLE points AS
  SELECT ST_MakePoint(
    1000*random(),
    1000*random()
    )::geometry(Point) AS geom,
     ((random()*3)::int+1)::text::point_type  point_type,
         pk
  FROM generate_series(1,6000) pk;
update points
set point_type='5' where pk=999;

Index creation

create index points_geom_idx
    on points using gist (geom);

CREATE INDEX points_dual ON points (point_type, geom);

Query that works but is very slow but works

Because of distances KNN's being pulled first, then filtered by the type constraint after?

explain analyse
with types as (
select column1::point_type point_type from (
values
('1'), ('2'), ('3'), ('4'),('5')
       )
)
SELECT c1.point_type,
       c1.pk AS main_id,
       b.pk  AS secondary_id,
       c1.secondary_point_type,
       b.secondary_point_type,
       b.distance
FROM (SELECT c.point_type,
             c.pk,
             c.geom,
             types.point_type secondary_point_type
      FROM  points c
          join types on true
          ) c1

         LEFT JOIN LATERAL ( SELECT c2.point_type,
                                    c2.geom,
                                    c2.pk,
                                    c2.point_type secondary_point_type,
                                    c1.geom <->c2.geom AS distance
                             FROM points c2

         where c1.pk <> c2.pk          and c1.secondary_point_type=c2.point_type

                             ORDER BY distance
                             LIMIT 1)  b on true;

Query that is very fast but doesn't provide correct results I believe this is because it's just getting the closest point, and if that point isn't of the correct type, the join ultimately fails, so no data is joined, leaving nulls for most results

explain analyse
with types as (
select column1::point_type point_type from (
values
('1'), ('2'), ('3'), ('4'),('5')
       )
)
SELECT c1.point_type,
       c1.pk AS main_id,
       b.pk  AS secondary_id,
       c1.secondary_point_type,
       b.secondary_point_type,
       b.distance
FROM (SELECT c.point_type,
             c.pk,
             c.geom,
             types.point_type secondary_point_type
      FROM  points c
          join types on true
          ) c1

         LEFT JOIN LATERAL ( SELECT c2.point_type,
                                    c2.geom,
                                    c2.pk,
                                    c2.point_type secondary_point_type,
                                    c1.geom <->c2.geom AS distance
                             FROM points c2

         where c1.pk <> c2.pk
                             ORDER BY distance
                             LIMIT 1)  b on c1.secondary_point_type=b.secondary_point_type ;

I'm trying to achieve this query quickly, using the spatial index for all knn measures across all types. Thanks!

outputs for analyze first query:

QUERY PLAN
Sort  (cost=575654.98..575729.98 rows=30000 width=28) (actual time=237669.083..237681.135 rows=30000 loops=1)
"  Output: c.point_type, c.pk, c2.pk, ((""*VALUES*"".column1)::point_type), c2.point_type, ((c.geom <-> c2.geom))"
  Sort Key: c2.point_type DESC
  Sort Method: quicksort  Memory: 2409kB
  Buffers: shared hit=20818076 read=32
  ->  Nested Loop Left Join  (cost=0.15..573424.08 rows=30000 width=28) (actual time=3755.511..237506.242 rows=30000 loops=1)
"        Output: c.point_type, c.pk, c2.pk, (""*VALUES*"".column1)::point_type, c2.point_type, ((c.geom <-> c2.geom))"
        Buffers: shared hit=20818076 read=32
        ->  Nested Loop  (cost=0.00..499.07 rows=30000 width=72) (actual time=3750.773..3885.627 rows=30000 loops=1)
"              Output: c.point_type, c.pk, c.geom, ""*VALUES*"".column1"
              Buffers: shared hit=64
              ->  Seq Scan on public.points c  (cost=0.00..124.00 rows=6000 width=40) (actual time=0.125..21.764 rows=6000 loops=1)
                    Output: c.geom, c.point_type, c.pk
                    Buffers: shared hit=64
              ->  Materialize  (cost=0.00..0.09 rows=5 width=32) (actual time=0.002..0.009 rows=5 loops=6000)
"                    Output: ""*VALUES*"".column1"
"                    ->  Values Scan on ""*VALUES*""  (cost=0.00..0.06 rows=5 width=32) (actual time=0.011..0.036 rows=5 loops=1)"
"                          Output: ""*VALUES*"".column1"
        ->  Limit  (cost=0.15..19.07 rows=1 width=52) (actual time=7.779..7.780 rows=1 loops=30000)
              Output: NULL::point_type, NULL::geometry(Point), c2.pk, c2.point_type, ((c.geom <-> c2.geom))
              Buffers: shared hit=20818012 read=32
              ->  Index Scan using points_geom_idx on public.points c2  (cost=0.15..567.90 rows=30 width=52) (actual time=7.768..7.768 rows=1 loops=30000)
                    Output: NULL::point_type, NULL::geometry(Point), c2.pk, c2.point_type, (c.geom <-> c2.geom)
                    Order By: (c2.geom <-> c.geom)
"                    Filter: ((c.pk <> c2.pk) AND ((""*VALUES*"".column1)::point_type = c2.point_type))"
                    Rows Removed by Filter: 698
                    Buffers: shared hit=20818012 read=32
Settings: search_path = 'public, topology, tiger'
Planning:
  Buffers: shared hit=29 read=1
Planning Time: 8.699 ms
JIT:
  Functions: 12
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 38.104 ms, Inlining 761.660 ms, Optimization 1720.310 ms, Emission 1255.443 ms, Total 3775.517 ms
Execution Time: 238241.981 ms

Second query:

QUERY PLAN
Nested Loop  (cost=0.88..1197.38 rows=30000 width=28) (actual time=3.535..4538.832 rows=30000 loops=1)
"  Output: c.point_type, c.pk, b.pk, (""*VALUES*"".column1)::point_type, b.secondary_point_type, b.distance"
  Buffers: shared hit=36251
  ->  Seq Scan on public.points c  (cost=0.00..124.00 rows=6000 width=40) (actual time=0.095..4.897 rows=6000 loops=1)
        Output: c.geom, c.point_type, c.pk
        Buffers: shared hit=64
  ->  Hash Left Join  (cost=0.88..0.98 rows=5 width=48) (actual time=0.726..0.743 rows=5 loops=6000)
"        Output: ""*VALUES*"".column1, b.pk, b.secondary_point_type, b.distance"
"        Hash Cond: ((""*VALUES*"".column1)::point_type = b.secondary_point_type)"
        Buffers: shared hit=36187
"        ->  Values Scan on ""*VALUES*""  (cost=0.00..0.06 rows=5 width=32) (actual time=0.001..0.008 rows=5 loops=6000)"
"              Output: ""*VALUES*"".column1"
        ->  Hash  (cost=0.87..0.87 rows=1 width=16) (actual time=0.707..0.707 rows=1 loops=6000)
              Output: b.pk, b.secondary_point_type, b.distance
              Buckets: 1024  Batches: 1  Memory Usage: 9kB
              Buffers: shared hit=36187
              ->  Subquery Scan on b  (cost=0.15..0.87 rows=1 width=16) (actual time=0.701..0.703 rows=1 loops=6000)
                    Output: b.pk, b.secondary_point_type, b.distance
                    Buffers: shared hit=36187
                    ->  Limit  (cost=0.15..0.86 rows=1 width=52) (actual time=0.700..0.700 rows=1 loops=6000)
                          Output: NULL::point_type, NULL::geometry(Point), c2.pk, c2.point_type, ((c.geom <-> c2.geom))
                          Buffers: shared hit=36187
                          ->  Index Scan using points_geom_idx on public.points c2  (cost=0.15..4249.52 rows=5999 width=52) (actual time=0.695..0.695 rows=1 loops=6000)
                                Output: NULL::point_type, NULL::geometry(Point), c2.pk, c2.point_type, (c.geom <-> c2.geom)
                                Order By: (c2.geom <-> c.geom)
                                Filter: (c.pk <> c2.pk)
                                Rows Removed by Filter: 1
                                Buffers: shared hit=36187
Settings: search_path = 'public, topology, tiger'
Planning Time: 3.206 ms
Execution Time: 4549.481 ms

Answer that sped up by over a factor of 10

Thanks @jjanes and @Frank Heikens

CREATE INDEX points_dual2 ON points using gist (point_type, geom);
QUERY PLAN
Sort  (cost=131479.98..131554.98 rows=30000 width=28) (actual time=19459.793..19477.593 rows=30000 loops=1)
"  Output: c.point_type, c.pk, c2.pk, ((""*VALUES*"".column1)::point_type), c2.point_type, ((c.geom <-> c2.geom))"
  Sort Key: c2.point_type DESC
  Sort Method: quicksort  Memory: 2315kB
  Buffers: shared hit=123553
  ->  Nested Loop Left Join  (cost=0.15..129249.07 rows=30000 width=28) (actual time=146.494..19302.907 rows=30000 loops=1)
"        Output: c.point_type, c.pk, c2.pk, (""*VALUES*"".column1)::point_type, c2.point_type, ((c.geom <-> c2.geom))"
        Buffers: shared hit=123553
        ->  Nested Loop  (cost=0.00..499.07 rows=30000 width=72) (actual time=145.338..304.580 rows=30000 loops=1)
"              Output: c.point_type, c.pk, c.geom, ""*VALUES*"".column1"
              Buffers: shared hit=64
              ->  Seq Scan on public.points c  (cost=0.00..124.00 rows=6000 width=40) (actual time=0.069..15.665 rows=6000 loops=1)
                    Output: c.geom, c.point_type, c.pk
                    Buffers: shared hit=64
              ->  Materialize  (cost=0.00..0.09 rows=5 width=32) (actual time=0.001..0.009 rows=5 loops=6000)
"                    Output: ""*VALUES*"".column1"
"                    ->  Values Scan on ""*VALUES*""  (cost=0.00..0.06 rows=5 width=32) (actual time=0.015..0.040 rows=5 loops=1)"
"                          Output: ""*VALUES*"".column1"
        ->  Limit  (cost=0.15..4.27 rows=1 width=52) (actual time=0.621..0.622 rows=1 loops=30000)
              Output: NULL::point_type, NULL::geometry(Point), c2.pk, c2.point_type, ((c.geom <-> c2.geom))
              Buffers: shared hit=123489
              ->  Index Scan using points_dual2 on public.points c2  (cost=0.15..123.58 rows=30 width=52) (actual time=0.605..0.605 rows=1 loops=30000)
                    Output: NULL::point_type, NULL::geometry(Point), c2.pk, c2.point_type, (c.geom <-> c2.geom)
"                    Index Cond: (c2.point_type = (""*VALUES*"".column1)::point_type)"
                    Order By: (c2.geom <-> c.geom)
                    Filter: (c.pk <> c2.pk)
                    Rows Removed by Filter: 0
                    Buffers: shared hit=123489
Settings: search_path = 'public, topology, tiger'
Planning:
  Buffers: shared hit=24
Planning Time: 4.363 ms
JIT:
  Functions: 15
  Options: Inlining false, Optimization false, Expressions true, Deforming true
  Timing: Generation 33.664 ms, Inlining 0.000 ms, Optimization 2.451 ms, Emission 141.230 ms, Total 177.345 ms
Execution Time: 19535.144 ms

Solution

  • Your 2nd index needs to be of type GiST, like your first one is. To do that, you need the help of the extension btree_gist.

    CREATE EXTENSION btree_gist;
    CREATE INDEX points_dual ON points using gist (point_type, geom);