I have a new table that looks like this, with 3e6 rows:
CREATE TABLE everything_crowberry (
id SERIAL PRIMARY KEY,
group_id INTEGER,
group_type group_type_name,
epub_id TEXT,
reg_user_id INTEGER,
device_id TEXT,
campaign_id INTEGER,
category_name TEXT,
instance_name TEXT,
protobuf TEXT,
UNIQUE (group_id, group_type, reg_user_id, category_name, instance_name)
);
This generally makes sense for my context, and most queries are acceptably fast.
But what's not fast are queries like this:
analytics_staging=> explain analyze select count(distinct group_id) from everything_crowberry;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=392177.29..392177.30 rows=1 width=4) (actual time=8909.698..8909.699 rows=1 loops=1)
-> Seq Scan on everything_crowberry (cost=0.00..384180.83 rows=3198583 width=4) (actual time=0.461..6347.272 rows=3198583 loops=1)
Planning time: 0.063 ms
Execution time: 8909.730 ms
(4 rows)
Time: 8910.110 ms
analytics_staging=> select count(distinct group_id) from everything_crowberry;
count
-------
481
Time: 8736.364 ms
I did create an index on group_id
, but while that index is used on a WHERE clause, it is not being used in the above. So I conclude I've misunderstood something about how postgres uses indices. Note (query result) that there are under 500 distinct group_id's.
CREATE INDEX everything_crowberry_group_id ON everything_crowberry(group_id);
Any pointers what I've misunderstood or how to make this particular query go faster?
To help with questions posed in the comments, I've added the suggested changes here. For future readers, I've included the detail to better understand how this gets debugged.
I notice in playing about that most of the time is taken up in the initial aggregate.
Turning off seqscan makes this much worse:
analytics_staging=> set enable_seqscan = false;
analytics_staging=> explain analyze select count(distinct group_id) from everything_crowberry;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=444062.28..444062.29 rows=1 width=4) (actual time=38927.323..38927.323 rows=1 loops=1)
-> Bitmap Heap Scan on everything_crowberry (cost=51884.99..436065.82 rows=3198583 width=4) (actual time=458.252..36167.789 rows=3198583 loops=1)
Heap Blocks: exact=35734 lossy=316446
-> Bitmap Index Scan on everything_crowberry_group (cost=0.00..51085.35 rows=3198583 width=0) (actual time=448.537..448.537 rows=3198583 loops=1)
Planning time: 0.064 ms
Execution time: 38927.971 ms
Time: 38930.328 ms
Restricting to a very small set of group id's makes it even worse, whereas I might have thought it would be easier to count a smaller set of things.
analytics_staging=> explain analyze select count(distinct group_id) from everything_crowberry WHERE group_id > 380;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=385954.43..385954.44 rows=1 width=4) (actual time=13438.422..13438.422 rows=1 loops=1)
-> Bitmap Heap Scan on everything_crowberry (cost=18742.95..383451.68 rows=1001099 width=4) (actual time=132.571..12673.233 rows=986572 loops=1)
Recheck Cond: (group_id > 380)
Rows Removed by Index Recheck: 70816
Heap Blocks: exact=49632 lossy=79167
-> Bitmap Index Scan on everything_crowberry_group (cost=0.00..18492.67 rows=1001099 width=0) (actual time=120.816..120.816 rows=986572 loops=1)
Index Cond: (group_id > 380)
Planning time: 1.294 ms
Execution time: 13439.017 ms
(9 rows)
Time: 13442.603 ms
analytics_staging=> explain(analyze, buffers) select count(distinct group_id) from everything_crowberry;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=392177.29..392177.30 rows=1 width=4) (actual time=7329.775..7329.775 rows=1 loops=1)
Buffers: shared hit=16283 read=335912, temp read=4693 written=4693
-> Seq Scan on everything_crowberry (cost=0.00..384180.83 rows=3198583 width=4) (actual time=0.224..4615.015 rows=3198583 loops=1)
Buffers: shared hit=16283 read=335912
Planning time: 0.089 ms
Execution time: 7329.818 ms
Time: 7331.084 ms
Increasing it from the default 4 MB to 10 MB provides a bit of improvement, from 7300 ms to 5500 ms or so.
analytics_staging=> EXPLAIN(analyze, buffers) SELECT group_id FROM everything_crowberry GROUP BY group_id;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=392177.29..392181.56 rows=427 width=4) (actual time=4686.525..4686.612 rows=481 loops=1)
Group Key: group_id
Buffers: shared hit=96 read=352099
-> Seq Scan on everything_crowberry (cost=0.00..384180.83 rows=3198583 width=4) (actual time=0.034..4017.122 rows=3198583 loops=1)
Buffers: shared hit=96 read=352099
Planning time: 0.094 ms
Execution time: 4686.686 ms
Time: 4687.461 ms
analytics_staging=> EXPLAIN(analyze, buffers) SELECT distinct group_id FROM everything_crowberry;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=392177.29..392181.56 rows=427 width=4) (actual time=5536.151..5536.262 rows=481 loops=1)
Group Key: group_id
Buffers: shared hit=128 read=352067
-> Seq Scan on everything_crowberry (cost=0.00..384180.83 rows=3198583 width=4) (actual time=0.030..4946.024 rows=3198583 loops=1)
Buffers: shared hit=128 read=352067
Planning time: 0.074 ms
Execution time: 5536.321 ms
Time: 5537.380 ms
analytics_staging=> SELECT count(*) FROM (SELECT 1 FROM everything_crowberry GROUP BY group_id) ec;
count
-------
481
Time: 4927.671 ms
analytics_production=> CREATE VIEW everything_crowberry_group_view AS select distinct group_id, group_type FROM everything_crowberry;
CREATE VIEW
analytics_production=> EXPLAIN(analyze, buffers) SELECT distinct group_id FROM everything_crowberry_group_view;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.56..357898.89 rows=200 width=4) (actual time=0.046..1976.882 rows=447 loops=1)
Buffers: shared hit=667230 read=109291 dirtied=108 written=988
-> Subquery Scan on everything_crowberry_group_view (cost=0.56..357897.19 rows=680 width=4) (actual time=0.046..1976.616 rows=475 loops=1)
Buffers: shared hit=667230 read=109291 dirtied=108 written=988
-> Unique (cost=0.56..357890.39 rows=680 width=8) (actual time=0.044..1976.378 rows=475 loops=1)
Buffers: shared hit=667230 read=109291 dirtied=108 written=988
-> Index Only Scan using everything_crowberry_group_id_group_type_reg_user_id_catego_key on everything_crowberry (cost=0.56..343330.63 rows=2911953 width=8) (actual time=0.043..1656.409 rows=2912005 loops=1)
Heap Fetches: 290488
Buffers: shared hit=667230 read=109291 dirtied=108 written=988
Planning time: 1.842 ms
Execution time: 1977.086 ms
For relatively few distinct values in group_id
(many rows per group) - as seems to be your case:
3e6 rows / under 500 distinct group_id's
To make this fast, you would need an index skip scan (a.k.a. loose index scan). That's not implemented up to Postgres 12. But you can work around the limitation with a recursive query:
Replace:
select count(distinct group_id) from everything_crowberry;
With:
WITH RECURSIVE cte AS (
(SELECT group_id FROM everything_crowberry ORDER BY group_id LIMIT 1)
UNION ALL
SELECT (SELECT group_id FROM everything_crowberry
WHERE group_id > t.group_id ORDER BY group_id LIMIT 1)
FROM cte t
WHERE t.group_id IS NOT NULL
)
SELECT count(group_id) FROM cte;
I use count(group_id)
instead of the slightly faster count(*)
to conveniently eliminate a NULL
value from the final recursion - as count(<expression>)
only counts nonnull values.
Also, does not matter whether group_id
can be NULL
, as that's not counted by your query anyway.
Can use the index you already have:
CREATE INDEX everything_crowberry_group_id ON everything_crowberry(group_id);
Related:
For relatively many distinct values in group_id
(few rows per group) - or for small tables - plain DISTINCT
will be faster. Typically fastest when done in a subquery, as opposed to an added clause to the count()
:
SELECT count(group_id) -- or just count(*) to include possible NULL value
FROM (SELECT DISTINCT group_id FROM everything_crowberry) sub;