arrays performance postgresql indexing postgresql-performance

Looking in array with a big set of input values

I have a table that looks like (with an example of number of rows in each to get the kind of ration):

expectedreportsnodes (1 000 000 rows):

 nodejoinkey   | integer  | not null
 nodeid        | text     | not null
 nodeconfigids | text[]   |

nodeconfigids array generally contains 1-50 values.

And a second table:

expectedreports (10 000 rows):

 pkid       | integer  | not null
 nodejoinkey| integer  | not null
 ...

I want to query for all expected reports for which their exists a entry in nodeexpectedreports with a given nodeConfigId. I have potentially large amount of nodeConfigIds (thousands).

What is most efficient way of doing that?

For now, I have:

select E.pkid, E.nodejoinkey from expectedreports E 
inner join (
  select NN.nodejoinkey, NN.nodeid, NN.nodeconfigids from (
    select N.nodejoinkey, N.nodeid, unnest(N.nodeconfigids) as nodeconfigids  
    from expectedreportsnodes N
  ) as NN 
  where NN.nodeconfigids) IN( VALUES ('cf1'), ('cf2'), ..., ('cf1000'), ..., ('cfN')  )
  ) as NNN on E.nodejoinkey = NNN.nodejoinkey;

This seems to give the expected results but takes ages to execute.

What can be done to improve the query?

Updates:

the proposed answer with array overlap and indexes is vastly less efficient on my set-up. I'm not able to say why.
the following version seems to be the quickiest (again, not the least idea why - perhaps because I generally have few values in values in nodeconfigids?):

select E.pkid, E.nodejoinkey from expectedreports E
inner join (
  select NN.nodejoinkey, NN.nodeconfigids
  from (
    select N.nodejoinkey, N.nodeconfigids, 
           generate_subscripts(N.nodeconfigids,1) as v
    from expectedreportsnodes N
  ) as NN
  where NN.nodeconfigids[v] in(values ('cf1'), ('cf2'), ..., ('cf1000'), ..., ('cfN') )
) as NNN
on E.nodejoinkey = NNN.nodejoinkey

Solution

The key to performance is a GIN index on the array column. And work with operators that can use the index.

CREATE INDEX ern_gin_idx ON expectedreportsnodes USING gin (nodeconfigids);

Query:

SELECT e.pkid, nodejoinkey 
FROM   expectedreports e
JOIN   expectedreportsnodes n USING (nodejoinkey)
WHERE  n.nodeconfigids && '{cf1, cf2, ..., cfN}'::text[];

This should work just fine for arrays of text, because the overlap operator && is supported by the default GIN operator class. Per documentation:

Name        Indexed Data Type  Indexable Operators
...
_text_ops   text[]             && <@ = @>
...

Also be sure to have a plain btree index on expectedreports.nodejoinkey:

CREATE INDEX expectedreports_nodejoinkey_idx ON expectedreports (nodejoinkey);

Optimize with multicolumn index

To optimize further for the given query, you could include the otherwise not useful column nodejoinkey to the index to allow index-only scans.

To include the integer column, first install the additional module btree_gin, which provides the necessary GIN operator classes. Run once per database:

CREATE EXTENSION btree_gin;

Then:

CREATE INDEX ern_multi_gin_idx ON expectedreportsnodes
USING gin (nodejoinkey, nodeconfigids);

Same query.
Related answers with more details:

Alternative with `unnest()`

If the GIN index isn't an option (or doesn't perform to your expectations) you can still optimize the query.

It's particularly efficient to unnest long input arrays (or use a VALUES expression like in your example) and then join to the derived table. An IN construct is typically the slowest option.

SELECT e.pkid, nodejoinkey
FROM  (
   SELECT DISTINCT n.nodejoinkey 
   FROM  (SELECT nodejoinkey, unnest(nodeconfigids) AS nodeconfigid
          FROM   expectedreportsnodes) n
   JOIN  (VALUES ('cf1'), ('cf2'), ..., ('cfN')) t(nodeconfigid) USING (nodeconfigid)
   ) n
JOIN   expectedreports e USING (nodejoinkey);

Modern form in Postgres 9.3+ with an implicit JOIN LATERAL:

SELECT e.pkid, nodejoinkey
FROM  (
   SELECT DISTINCT n.nodejoinkey 
   FROM  expectedreportsnodes n
       , unnest(n.nodeconfigids) nodeconfigid
   JOIN  unnest('{cf1, cf2, ..., cfN}'::text[]) t(nodeconfigid) USING (nodeconfigid)
   ) n
JOIN   expectedreports e USING (nodejoinkey);

Your original query can produce duplicate rows in the result. Folded with DISTINCT.
Details for JOIN LATERAL:
- Dynamically execute query using the output of another query

For short input arrays, an `ANY` construct is faster:

SELECT e.pkid, nodejoinkey
FROM  (
   SELECT DISTINCT e.nodejoinkey 
   FROM   expectedreportsnodes e
   JOIN   unnest(e.nodeconfigids) u(nodeconfigid) 
          ON u.nodeconfigid = ANY ('{cf1, cf2, ..., cfN}'::text[])
   ) n
JOIN   expectedreports e USING (nodejoinkey);

Looking in array with a big set of input values

Optimize with multicolumn index

Alternative with unnest()

For short input arrays, an ANY construct is faster:

Alternative with `unnest()`

For short input arrays, an `ANY` construct is faster: