Search code examples
sqlpostgresqlindexinggreatest-n-per-grouppostgresql-performance

Get values appearing at least N times in a table quickly


I have a Postgres 10.10 database with a table of more than 6 million rows and the following definition:

create table users (
    id              bigserial primary key,
    user_id         text      unique,
    username        text,
    first_name      text,
    last_name       text,
    language_code   text,
    gender          text,
    first_seen      timestamp with time zone,
    last_seen       timestamp with time zone,
    search_language text,
    age             text
);

create index users_language_code_idx on users (language_code);
create index users_last_seen_idx     on users (last_seen);
create index users_first_seen_idx1   on users (first_seen);
create index users_age_idx           on users (age);
create index users_last_seen_age_idx on users (last_seen, age);

And I have a query to fetch popular language codes with more than 100 users:

SELECT language_code FROM users
GROUP BY language_code
HAVING count(*) > 100;

At some point this query started to take a huge amount of time to finish (~10 minutes). Btree index on language_code didn't help. What else can I do to improve the performance?

Here's explain analyze output:

https://explain.depesz.com/s/j2ga

Finalize GroupAggregate  (cost=7539479.67..7539480.34 rows=27 width=3) (actual time=620744.389..620744.458 rows=24 loops=1)
  Group Key: language_code
  Filter: (count(*) > 100)
  Rows Removed by Filter: 60
  ->  Sort  (cost=7539479.67..7539479.80 rows=54 width=11) (actual time=620744.359..620744.372 rows=84 loops=1)
        Sort Key: language_code
        Sort Method: quicksort  Memory: 28kB
        ->  Gather  (cost=7539472.44..7539478.11 rows=54 width=11) (actual time=620744.038..620744.727 rows=84 loops=1)
              Workers Planned: 2
              Workers Launched: 0
              ->  Partial HashAggregate  (cost=7538472.44..7538472.71 rows=27 width=11) (actual time=620743.596..620743.633 rows=84 loops=1)
                    Group Key: language_code
                    ->  Parallel Seq Scan on users  (cost=0.00..7525174.96 rows=2659496 width=3) (actual time=0.377..616632.155 rows=6334894 loops=1)
Planning time: 0.194 ms
Execution time: 620745.276 ms

Solution

  • You can make good use of the index on (language_code) with an emulated index skip scan:

    WITH RECURSIVE cte AS (
       SELECT min(language_code) AS language_code
       FROM   users
       
       UNION ALL
       SELECT (SELECT language_code
               FROM   users
               WHERE  language_code > c.language_code
               ORDER  BY language_code
               LIMIT  1)
       FROM   cte c
       WHERE  c.language_code IS NOT NULL
       )
    SELECT language_code
    FROM   cte c
    JOIN   LATERAL (
       SELECT count(*) AS ct
       FROM  (
          SELECT -- can stay empty
          FROM   users
          WHERE  language_code = c.language_code 
          LIMIT  101
          ) sub
       ) u ON ct > 100  -- "more than 100"
    WHERE  language_code IS NOT NULL;
    

    db<>fiddle here

    Given your numbers (6M rows, but only a hand full of distinct language codes), this should perform faster by orders of magnitude.

    The first part - the recursive CTE (rCTE) named cte - produces the set of distinct language_code in the table (except NULL). A table with distinct language codes could replace that part to be faster, yet. (It might be a good idea to maintain such a table and enforce referential integrity with a FK constraint to it ...)

    The second part only looks at a maximum of 101 rows (your threshold) per language code. This way we avoid the expensive sequential scan over the whole big table.

    If your table is "vacuumed" enough, you should see index-only scans exclusively.

    Upgrading to the current version Postgres 13 should help some more due to the newly introduced index deduplication that should make said index substantially smaller (as it's highly duplicative).

    Sadly, automatic index skip scans did not make it into version 13. Maybe Postgres 14. But above emulation should be almost as good.

    Further reading (with detailed explanation for above query technique):