sql postgresql indexing query-optimization

For each array element, count rows whose array column contains it

In table users, there is a column user_ids varchar[] with GIN index. Each user_id of column user_ids in each row is unique.

I want to count the number of rows that contain user_id for each user_id in an input list.

I try using this query but EXPLAIN shows it doesn't use GIN index:

SELECT user_id, count(*) + 1 COUNT
FROM (
    SELECT unnest(user_ids) user_id
    FROM users 
    WHERE user_ids && ARRAY['1', '2', '3']::varchar[]
) sub
WHERE user_id = ANY(ARRAY['1', '2', '3']::varchar[])
GROUP BY user_id;

Is there any way to make use of GIN index here?

Solution

For small input arrays and limited overlap between rows, this query should be as fast as it gets:

SELECT i.user_id, u.*
FROM   unnest('{1, 2, 3}'::varchar[]) i(user_id)  -- your input here
CROSS  JOIN LATERAL (
   SELECT count(*) AS count
   FROM   users u
   WHERE  u.user_ids @> ARRAY[i.user_id]  -- !
   ) sub;

Notable difference: my query includes elements of the input with 0 matches, your original does not.

The array operator @> expects array types as left and right operand (even if it's just a single element to the right like in this case).

Array operators can use a GIN index on (user_ids). The subquery in your original query also qualifies for index usage!

See:

Why does my query plan not use the GIN index?

Maybe Postgres decides for a different query plan because the filter is not selective enough, or column statistics are outdated?

To determine whether the index is applicable at all, run a test with this local setting (only in your session!):

SET enable_seqscan = off;

EXPLAIN ....

This forces a plan without sequential scan if at all possible.