Search code examples
postgresqlindexingpostgresql-9.4postgresql-performance

Postgres multi-column index (integer, boolean, and array)


I have a Postgres 9.4 database with a table like this:

| id | other_id | current | dn_ids                                | rank |
|----|----------|---------|---------------------------------------|------|
| 1  | 5        | F       | {123,234,345,456,111,222,333,444,555} | 1    |
| 2  | 7        | F       | {123,100,200,900,800,700,600,400,323} | 2    |

(update) I already have a couple indexes defined. Here is the CREATE TABLE syntax:

CREATE TABLE mytable (
    id integer NOT NULL,
    other_id integer,
    rank integer,
    current boolean DEFAULT false,
    dn_ids integer[] DEFAULT '{}'::integer[]
);

CREATE SEQUENCE mytable_id_seq START WITH 1 INCREMENT BY 1 NO MINVALUE NO MAXVALUE CACHE 1;

ALTER TABLE ONLY mytable ALTER COLUMN id SET DEFAULT nextval('mytable_id_seq'::regclass);
ALTER TABLE ONLY mytable ADD CONSTRAINT mytable_pkey PRIMARY KEY (id);

CREATE INDEX ind_dn_ids ON mytable USING gin (dn_ids);
CREATE INDEX index_mytable_on_current ON mytable USING btree (current);
CREATE INDEX index_mytable_on_other_id ON mytable USING btree (other_id);
CREATE INDEX index_mytable_on_other_id_and_current ON mytable USING btree (other_id, current);

I need to optimize queries like this:

SELECT id, dn_ids
FROM mytable
WHERE other_id = 5 AND current = F AND NOT (ARRAY[100,200] && dn_ids)
ORDER BY rank ASC
LIMIT 500 OFFSET 1000

This query works fine, but I'm sure it could be much faster with smart indexing. There are about 250,000 rows in the table and I always have current = F as a predicate. The input array I'm comparing to the stored array will have 1-9 integers, as well. The other_id can vary. But generally, before limiting, the scan will match between 0-25,000 rows.

Here's an example EXPLAIN:

Limit  (cost=36944.53..36945.78 rows=500 width=65)
  ->  Sort  (cost=36942.03..37007.42 rows=26156 width=65)
        Sort Key: rank
        ->  Seq Scan on mytable  (cost=0.00..35431.42 rows=26156 width=65)
              Filter: ((NOT current) AND (NOT ('{-1,35257,35314}'::integer[] && dn_ids)) AND (other_id = 193))

Other answers on this site and the Postgres docs suggest it's possible to add a compound index to improve performance. I already have one on [other_id, current]. I've also read in various places that indexing can improve the performance of the ORDER BY in addition to the WHERE clause.

  1. What's the right type of compound index to use for this query? I don't care about space at all.

  2. Does it matter much how I order the terms in the WHERE clause?


Solution

    1. What's the right type of compound index to use for this query? I don't care about space at all.

    This depends on the complete situation. Either way, the GIN index you already have is most probably superior to a GiST index in your case:

    You can combine either with integer columns once you install the additional module btree_gin (or btree_gist, respectively).

    However, that does not cover the boolean data type, which typically doesn't make sense as index column to begin with. With just two (three incl. NULL) possible values it's not selective enough.

    And a plain btree index is more efficient for integer. While a multicolumn btree index on two integer columns would certainly help, you'll have to test carefully if combining (other_id, dn_ids) in a multicolumn GIN index is worth more than it costs. Probably not. Postgres can combine multiple indexes in a bitmap index scan rather efficiently.

    Finally, while indexes can be used for sorted output, this will probably not pay to apply for a query like you display (unless you select large parts of the table).
    Not applicable to updated question.

    Partial indexes might be an option. Other than that, you already have all the indexes you need.

    I would drop the pointless index on the boolean column current completely, and the index on just rank is probably never used for this query.

    1. Does it matter much how I order the terms in the WHERE clause?

    The order of WHERE conditions is completely irrelevant.

    Addendum after question update

    The utility of indexes is bound to selective criteria. If more than roughly 5 % (depends on various factors) of the table are selected, a sequential scan of the whole table is typically faster than dealing with the overhead on any indexes - except for pre-sorting output, that's the one thing an index is still good for in such cases.

    For a query that fetches 25,000 of 250,000 rows, indexes are mostly just for that - which gets all the more interesting if you attach a LIMIT clause. Postgres can stop fetching rows from an index once the LIMIT is satisfied.

    Be aware that Postgres always needs to read OFFSET + LIMIT rows, so performance deteriorate with the sum of both.

    Even with your added information, much of what's relevant is still in the dark. I am going to assume that:

    1. Your predicate NOT (ARRAY[100,200] && dn_ids) is not very selective. Ruling out 1 to 10 ID values should typically retain the majority of rows unless you have very few distinct elements in dn_ids.
    2. The most selective predicate is other_id = 5.
    3. A substantial part of the rows is eliminated with NOT current.
      Aside: current = F isn't valid syntax in standard Postgres. Must be NOT current or current = FALSE;

    While a GIN index would be great to identify few rows with matching arrays faster than any other index type, this seems hardly relevant for your query. My best guess is this partial, multicolumn btree index:

    CREATE INDEX foo ON mytable (other_id, rank, dn_ids)
    WHERE NOT current;
    

    The array column dn_ids in a btree index cannot support the && operator, I just include it to allow index-only scans and filter rows before accessing the heap (the table). May even be faster without dn_ids in the index:

    CREATE INDEX foo ON mytable (other_id, rank) WHERE NOT current;
    

    GiST indexes may become more interesting in Postgres 9.5 due to this new feature:

    Allow GiST indexes to perform index-only scans (Anastasia Lubennikova, Heikki Linnakangas, Andreas Karlsson)

    Aside: current is a reserved word in standard SQL, even if it's allowed as identifier in Postgres.
    Aside 2: I assume id is an actual serial column with the column default set. Just creating a sequence like you demonstrate, would do nothing.