Search code examples
sqlpostgresqlindexingquery-optimization

For each array element, count rows whose array column contains it


In table users, there is a column user_ids varchar[] with GIN index. Each user_id of column user_ids in each row is unique.

I want to count the number of rows that contain user_id for each user_id in an input list.

I try using this query but EXPLAIN shows it doesn't use GIN index:

SELECT user_id, count(*) + 1 COUNT
FROM (
    SELECT unnest(user_ids) user_id
    FROM users 
    WHERE user_ids && ARRAY['1', '2', '3']::varchar[]
) sub
WHERE user_id = ANY(ARRAY['1', '2', '3']::varchar[])
GROUP BY user_id;

Is there any way to make use of GIN index here?


Solution

  • For small input arrays and limited overlap between rows, this query should be as fast as it gets:

    SELECT i.user_id, u.*
    FROM   unnest('{1, 2, 3}'::varchar[]) i(user_id)  -- your input here
    CROSS  JOIN LATERAL (
       SELECT count(*) AS count
       FROM   users u
       WHERE  u.user_ids @> ARRAY[i.user_id]  -- !
       ) sub;
    

    Notable difference: my query includes elements of the input with 0 matches, your original does not.

    The array operator @> expects array types as left and right operand (even if it's just a single element to the right like in this case).

    Array operators can use a GIN index on (user_ids). The subquery in your original query also qualifies for index usage!

    See:

    Why does my query plan not use the GIN index?

    Maybe Postgres decides for a different query plan because the filter is not selective enough, or column statistics are outdated?

    To determine whether the index is applicable at all, run a test with this local setting (only in your session!):

    SET enable_seqscan = off;
    
    EXPLAIN ....
    

    This forces a plan without sequential scan if at all possible.