I have a Postgres 10.10 database with a table of more than 6 million rows and the following definition:
create table users (
id bigserial primary key,
user_id text unique,
username text,
first_name text,
last_name text,
language_code text,
gender text,
first_seen timestamp with time zone,
last_seen timestamp with time zone,
search_language text,
age text
);
create index users_language_code_idx on users (language_code);
create index users_last_seen_idx on users (last_seen);
create index users_first_seen_idx1 on users (first_seen);
create index users_age_idx on users (age);
create index users_last_seen_age_idx on users (last_seen, age);
And I have a query to fetch popular language codes with more than 100 users:
SELECT language_code FROM users
GROUP BY language_code
HAVING count(*) > 100;
At some point this query started to take a huge amount of time to finish (~10 minutes). Btree index on language_code
didn't help. What else can I do to improve the performance?
Here's explain analyze
output:
https://explain.depesz.com/s/j2ga
Finalize GroupAggregate (cost=7539479.67..7539480.34 rows=27 width=3) (actual time=620744.389..620744.458 rows=24 loops=1)
Group Key: language_code
Filter: (count(*) > 100)
Rows Removed by Filter: 60
-> Sort (cost=7539479.67..7539479.80 rows=54 width=11) (actual time=620744.359..620744.372 rows=84 loops=1)
Sort Key: language_code
Sort Method: quicksort Memory: 28kB
-> Gather (cost=7539472.44..7539478.11 rows=54 width=11) (actual time=620744.038..620744.727 rows=84 loops=1)
Workers Planned: 2
Workers Launched: 0
-> Partial HashAggregate (cost=7538472.44..7538472.71 rows=27 width=11) (actual time=620743.596..620743.633 rows=84 loops=1)
Group Key: language_code
-> Parallel Seq Scan on users (cost=0.00..7525174.96 rows=2659496 width=3) (actual time=0.377..616632.155 rows=6334894 loops=1)
Planning time: 0.194 ms
Execution time: 620745.276 ms
You can make good use of the index on (language_code)
with an emulated index skip scan:
WITH RECURSIVE cte AS (
SELECT min(language_code) AS language_code
FROM users
UNION ALL
SELECT (SELECT language_code
FROM users
WHERE language_code > c.language_code
ORDER BY language_code
LIMIT 1)
FROM cte c
WHERE c.language_code IS NOT NULL
)
SELECT language_code
FROM cte c
JOIN LATERAL (
SELECT count(*) AS ct
FROM (
SELECT -- can stay empty
FROM users
WHERE language_code = c.language_code
LIMIT 101
) sub
) u ON ct > 100 -- "more than 100"
WHERE language_code IS NOT NULL;
db<>fiddle here
Given your numbers (6M rows, but only a hand full of distinct language codes), this should perform faster by orders of magnitude.
The first part - the recursive CTE (rCTE) named cte
- produces the set of distinct language_code
in the table (except NULL
). A table with distinct language codes could replace that part to be faster, yet. (It might be a good idea to maintain such a table and enforce referential integrity with a FK constraint to it ...)
The second part only looks at a maximum of 101 rows (your threshold) per language code. This way we avoid the expensive sequential scan over the whole big table.
If your table is "vacuumed" enough, you should see index-only scans exclusively.
Upgrading to the current version Postgres 13 should help some more due to the newly introduced index deduplication that should make said index substantially smaller (as it's highly duplicative).
Sadly, automatic index skip scans did not make it into version 13. Maybe Postgres 14. But above emulation should be almost as good.
Further reading (with detailed explanation for above query technique):