postgresql indexing group-by query-optimization aggregate-functions

PostgreSQL multi-column group by not using index when selecting minimum

When selecting MIN on a column in PostgreSQL (11, 12, 13) after a GROUP BY operation on multiple columns, any index created on the grouped columns is not used: https://dbfiddle.uk/?rdbms=postgres_13&fiddle=30e0f341940f4c1fa6013677643a0baf

CREATE TABLE tags (id serial, series int, index int, page int);
CREATE INDEX ON tags (page, series, index);

INSERT INTO tags (series, index, page)
SELECT
    ceil(random() * 10),
    ceil(random() * 100),
    ceil(random() * 1000)
FROM generate_series(1, 100000);

EXPLAIN ANALYZE
SELECT tags.page, tags.series, MIN(tags.index)
FROM tags GROUP BY tags.page, tags.series;

HashAggregate  (cost=2291.00..2391.00 rows=10000 width=12) (actual time=108.968..133.153 rows=9999 loops=1)
  Group Key: page, series
  Batches: 1  Memory Usage: 1425kB
  ->  Seq Scan on tags  (cost=0.00..1541.00 rows=100000 width=12) (actual time=0.015..55.240 rows=100000 loops=1)
Planning Time: 0.257 ms
Execution Time: 133.771 ms

Theoretically, the index should allow the database to seek in steps of (tags.page, tags.series) instead of performing a full scan. This would result in 10,000 processed rows for above dataset instead of 100,000. This link describes the method with no grouped columns.

This answer (as well as this one) suggests using DISTINCT ON with an ordering instead of GROUP BY but that produces this query plan:

Unique  (cost=0.42..5680.42 rows=10000 width=12) (actual time=0.066..268.038 rows=9999 loops=1)
  ->  Index Only Scan using tags_page_series_index_idx on tags  (cost=0.42..5180.42 rows=100000 width=12) (actual time=0.064..227.219 rows=100000 loops=1)
        Heap Fetches: 100000
Planning Time: 0.426 ms
Execution Time: 268.712 ms

While the index is now being used, it still appears to be scanning the full set of rows. When using SET enable_seqscan=OFF, the GROUP BY query degrades to the same behaviour.

How can I encourage PostgreSQL to use the multi-column index?

Solution

If you can pull the set of distinct page,series from another table then you can hack it with a lateral join:

CREATE TABLE pageseries AS SELECT DISTINCT page,series FROM tags ORDER BY page,series;
EXPLAIN ANALYZE SELECT p.*, minindex FROM pageseries p CROSS JOIN LATERAL (SELECT index minindex FROM tags t WHERE t.page=p.page AND t.series=p.series ORDER BY page,series,index LIMIT 1) x;
 Nested Loop  (cost=0.42..8720.00 rows=10000 width=12) (actual time=0.039..56.013 rows=10000 loops=1)
   ->  Seq Scan on pageseries p  (cost=0.00..145.00 rows=10000 width=8) (actual time=0.012..1.872 rows=10000 loops=1)
   ->  Limit  (cost=0.42..0.84 rows=1 width=12) (actual time=0.005..0.005 rows=1 loops=10000)
         ->  Index Only Scan using tags_page_series_index_idx on tags t  (cost=0.42..4.62 rows=10 width=12) (actual time=0.004..0.004 rows=1 loops=10000)
               Index Cond: ((page = p.page) AND (series = p.series))
               Heap Fetches: 0
 Planning Time: 0.168 ms
 Execution Time: 57.077 ms

...but it is not necessarily faster:

EXPLAIN ANALYZE                                                                                                                                              SELECT tags.page, tags.series, MIN(tags.index)
FROM tags GROUP BY tags.page, tags.series;

 HashAggregate  (cost=2291.00..2391.00 rows=10000 width=12) (actual time=56.177..58.923 rows=10000 loops=1)
   Group Key: page, series
   Batches: 1  Memory Usage: 1425kB
   ->  Seq Scan on tags  (cost=0.00..1541.00 rows=100000 width=12) (actual time=0.010..12.845 rows=100000 loops=1)
 Planning Time: 0.129 ms
 Execution Time: 59.644 ms

It would be massively faster IF the number of iterations in the nested loop was small, in other words if there was a low number of distinct (page,series). I'll try with series alone, since that has only 10 distinct values:

CREATE TABLE series AS SELECT DISTINCT series FROM tags;
EXPLAIN ANALYZE SELECT p.*, minindex FROM series p CROSS JOIN LATERAL (SELECT index minindex FROM tags t WHERE t.series=p.series ORDER BY series,index LIMIT 1) x;
 Nested Loop  (cost=0.29..886.18 rows=2550 width=8) (actual time=0.081..0.264 rows=10 loops=1)
   ->  Seq Scan on series p  (cost=0.00..35.50 rows=2550 width=4) (actual time=0.007..0.010 rows=10 loops=1)
   ->  Limit  (cost=0.29..0.31 rows=1 width=8) (actual time=0.024..0.024 rows=1 loops=10)
         ->  Index Only Scan using tags_series_index_idx on tags t  (cost=0.29..211.29 rows=10000 width=8) (actual time=0.023..0.023 rows=1 loops=10)
               Index Cond: (series = p.series)
               Heap Fetches: 0
 Planning Time: 0.198 ms
 Execution Time: 0.292 ms

In this case, definitely worth it, because the query hits only 10/100000 rows. The other queries hit 10000/100000 rows, or 10% of the table, which is above the threshold where an index would really help.

Note putting the column with lower cardinality first will result in a smaller index:

CREATE INDEX ON tags (series, page, index);
select pg_relation_size( 'tags_page_series_index_idx' );
          4284416
select pg_relation_size( 'tags_series_page_index_idx' );
          3104768

...but it doesn't make the query any faster.

If this type of stuff is really critical, perhaps try clickhouse or dolphindb.