postgresql indexing database-design postgresql-performance

Postgres using two indexes for WHERE clause in parallel

I have table which has a few billion rows. There are two columns in the table:

match_id uuid,
group_id integer

And there are indexes created on both of the above columns:

create index if not exists my_tbl_match_id_idx on my_tbl (match_id);
create index if not exists my_tbl_group_id_idx on my_tbl (group_id);

When I perform following query, both indexes are scanned in parallel. match_id index look up is much faster and returns result back in 13 milliseconds while group_id index look up is slower and takes 1.3 seconds. Because both indexes are looked up in parallel, overall query time = 1.3 seconds i.e. look up time of group_id index.

Is there any way Postgres can just use group_id index after the match_id result set is computed?

SELECT *
FROM my_table
WHERE match_id = 'e089e0af-543b-45d5-abbf-22c6c5ed9a01'
AND (group_id IN (167,1704,1864,2065,2145,3812,3814,3855,7462,11393,11394,11396))

Or do I need to create a composite index for match_id and group_id?

Solution

There are many factors with bearing on the best course of action, most of which are not clear from the question. Consider instructions for postgresql-performance questions here.

Generally speaking, a multicolumn index on (match_id, group_id) (with leading match_id!) performs substantially better for your query than combining two separate indexes in bitmap index scans.

Multicolumn index and performance

In effect, this index implements your requirement exactly:

Is there any way Postgres can just use group_id index after the match_id result set is computed?

A multicolumn index is sorted by leading expressions first. Using this index, Postgres excludes the lion share of irrelevant matches on group_id for your query. Should be particularly impactful for your case since match_id seems to be much more selective.

See:

This index also covers queries on just match_id like you mentioned in your comment. Since match_id is the more selective and bigger column (uuid = 16 bytes, int = 4 bytes), the multicolumn index isn't much bigger than the one on just (match_id), and should work almost as well for the purpose, making the overhead for maintaining a second index questionable. The other index on just (group_id) may still serve its purpose for queries on just group_id.

See:

Is a composite index also good for queries on the first field?

Other advice

Your query can benefit a lot from physically clustering rows on match_id. Use CLUSTER or one of the less blocking community tools. See:

Only reasonable if there isn't too much write activity on the table (or large parts of it).

Also, if you don't really need SELECT *, but only a small subset of columns, there is potential for optimization.