arrays postgresql indexing postgresql-9.5 any

Index usage with single item ANY(ARRAY[...])

Test table and indexes (PostgreSQL 9.5.3):

CREATE TABLE public.t (id serial, a integer, b integer);

INSERT INTO t(a, b) 
SELECT round(random()*1000), round(random()*1000)
FROM generate_series(1, 1000000);

CREATE INDEX "i_1" ON public.t USING btree (a, b);
CREATE INDEX "i_2" ON public.t USING btree (b);

If "a = 50" in the first query, everything is ok, appropriate index "i_1" is used:

SELECT * FROM t WHERE a = 50 ORDER BY b LIMIT 1

"Limit  (cost=0.42..4.03 rows=1 width=12) (actual time=0.085..0.085 rows=1 loops=1)"
"  Buffers: shared hit=1 read=3"
"  ->  Index Scan using i_1 on t  (cost=0.42..4683.12 rows=1300 width=12) (actual time=0.084..0.084 rows=1 loops=1)"
"        Index Cond: (a = 50)"
"        Buffers: shared hit=1 read=3"
"Planning time: 0.637 ms"
"Execution time: 0.114 ms"

With "a IN (50)" result is the same:

SELECT * FROM t WHERE a IN (50) ORDER BY b LIMIT 1

"Limit  (cost=0.42..4.03 rows=1 width=12) (actual time=0.058..0.058 rows=1 loops=1)"
"  Buffers: shared hit=4"
"  ->  Index Scan using i_1 on t  (cost=0.42..4683.12 rows=1300 width=12) (actual time=0.056..0.056 rows=1 loops=1)"
"        Index Cond: (a = 50)"
"        Buffers: shared hit=4"
"Planning time: 0.287 ms"
"Execution time: 0.105 ms"

The problem is when I try to use "a = ANY(ARRAY[50])". Wrong index "i_2" is used instead of "i_1" and execution time becomes x25 longer:

SELECT * FROM t WHERE a = ANY(ARRAY[50]) ORDER BY b LIMIT 1

"Limit  (cost=0.42..38.00 rows=1 width=12) (actual time=2.591..2.591 rows=1 loops=1)"
"  Buffers: shared hit=491 read=4"
"  ->  Index Scan using i_2 on t  (cost=0.42..48853.65 rows=1300 width=12) (actual time=2.588..2.588 rows=1 loops=1)"
"        Filter: (a = ANY ('{50}'::integer[]))"
"        Rows Removed by Filter: 520"
"        Buffers: shared hit=491 read=4"
"Planning time: 0.251 ms"
"Execution time: 2.627 ms"

You can say: "PostgreSQL can't use index if you use ANY(ARRAY[])", but actually it can. If I remove "ORDER BY" it works again:

SELECT * FROM t WHERE a = ANY(ARRAY[50]) LIMIT 1

"Limit  (cost=0.42..4.03 rows=1 width=12) (actual time=0.034..0.034 rows=1 loops=1)"
"  Buffers: shared hit=4"
"  ->  Index Scan using i_1 on t  (cost=0.42..4683.12 rows=1300 width=12) (actual time=0.033..0.033 rows=1 loops=1)"
"        Index Cond: (a = ANY ('{50}'::integer[]))"
"        Buffers: shared hit=4"
"Planning time: 0.182 ms"
"Execution time: 0.090 ms"

My questions:

If PostgreSQL is smart enough to work well with "IN", what's the problem with ANY(ARRAY[])?
Why does it work with ANY(ARRAY[]) if I remove "ORDER BY" clause?

Solution

PostgreSQL is not smart enough to figure out that a =ANY(ARRAY[50]) is the same as a = 50.
It does not check if the array contains only one element.

Processing of IN lists works like this (see transformAExprIn in src/backend/parser/parse_expr.c):

All the elements of the IN list are checked if they are constants or not (see here).
If there is more than one constant, and they all can be coerced to the same type, an =ANY expression is built (see here).
The expression from 2. (if any) and the remaining elements of the IN list are built into an OR expression (see here).

So for example the expression x IN (1, 2, mycol) will end up as x = ANY ('{2,3}'::integer[])) OR (x = mycol)), and x IN (42) will become x = 42 (there is no OR because the list has only one element).

On the other hand, =ANY expressions are handled in make_scalar_array_op in src/backend/parser/parse_oper.c. There is no attempt made to handle single-element arrays specially.

PostgreSQL can use an index scan for an =ANY condition, but unless the array has only one element, the result will not be ordered by b, so it will have to sort the result. Moreover, since the index scan cannot stop after the first result, it will have to scan many more rows.

To see why such a result will not be ordered, consider the following example:

This is part of the index:

 a  |  b
----+----
... | ...
 49 | 812
 50 |   1
 50 |   2
 50 | 595
 50 | 973
 51 |   5
 52 |  80
 52 | 991
 55 |  27
... | ...

Now if we scan the index for a =ANY(ARRAY[50,51]), the values for b found will be 1, 2, 595, 973, 5, 80 and 991 in that order. If we have to return the results with ORDER BY b, we need an additional sort. The only exception would be if the array contains only a single element, but PostgreSQL does not specifically check for that.

So PostgreSQL chooses to scan the other index, because it estimates that searching the table in sorted order and stopping when it encounters the first row that matches the =ANY condition will be cheaper.

If you remove the ORDER BY clause, PostgreSQL can stop scanning after the first hit in both cases, and using the first index is cheaper.