Search code examples
arraysperformancepostgresqlindexingpostgresql-performance

Looking in array with a big set of input values


I have a table that looks like (with an example of number of rows in each to get the kind of ration):

expectedreportsnodes (1 000 000 rows):

 nodejoinkey   | integer  | not null
 nodeid        | text     | not null
 nodeconfigids | text[]   | 

nodeconfigids array generally contains 1-50 values.

And a second table:

expectedreports (10 000 rows):

 pkid       | integer  | not null
 nodejoinkey| integer  | not null
 ...

I want to query for all expected reports for which their exists a entry in nodeexpectedreports with a given nodeConfigId. I have potentially large amount of nodeConfigIds (thousands).

What is most efficient way of doing that?

For now, I have:

select E.pkid, E.nodejoinkey from expectedreports E 
inner join (
  select NN.nodejoinkey, NN.nodeid, NN.nodeconfigids from (
    select N.nodejoinkey, N.nodeid, unnest(N.nodeconfigids) as nodeconfigids  
    from expectedreportsnodes N
  ) as NN 
  where NN.nodeconfigids) IN( VALUES ('cf1'), ('cf2'), ..., ('cf1000'), ..., ('cfN')  )
  ) as NNN on E.nodejoinkey = NNN.nodejoinkey;

This seems to give the expected results but takes ages to execute.

What can be done to improve the query?

Updates:

  • the proposed answer with array overlap and indexes is vastly less efficient on my set-up. I'm not able to say why.
  • the following version seems to be the quickiest (again, not the least idea why - perhaps because I generally have few values in values in nodeconfigids?):

_

select E.pkid, E.nodejoinkey from expectedreports E
inner join (
  select NN.nodejoinkey, NN.nodeconfigids
  from (
    select N.nodejoinkey, N.nodeconfigids, 
           generate_subscripts(N.nodeconfigids,1) as v
    from expectedreportsnodes N
  ) as NN
  where NN.nodeconfigids[v] in(values ('cf1'), ('cf2'), ..., ('cf1000'), ..., ('cfN') )
) as NNN
on E.nodejoinkey = NNN.nodejoinkey

Solution

  • The key to performance is a GIN index on the array column. And work with operators that can use the index.

    CREATE INDEX ern_gin_idx ON expectedreportsnodes USING gin (nodeconfigids);
    

    Query:

    SELECT e.pkid, nodejoinkey 
    FROM   expectedreports e
    JOIN   expectedreportsnodes n USING (nodejoinkey)
    WHERE  n.nodeconfigids && '{cf1, cf2, ..., cfN}'::text[];
    

    This should work just fine for arrays of text, because the overlap operator && is supported by the default GIN operator class. Per documentation:

    Name        Indexed Data Type  Indexable Operators
    ...
    _text_ops   text[]             && <@ = @>
    ...
    

    Also be sure to have a plain btree index on expectedreports.nodejoinkey:

    CREATE INDEX expectedreports_nodejoinkey_idx ON expectedreports (nodejoinkey);
    

    Optimize with multicolumn index

    To optimize further for the given query, you could include the otherwise not useful column nodejoinkey to the index to allow index-only scans.

    To include the integer column, first install the additional module btree_gin, which provides the necessary GIN operator classes. Run once per database:

    CREATE EXTENSION btree_gin;
    

    Then:

    CREATE INDEX ern_multi_gin_idx ON expectedreportsnodes
    USING gin (nodejoinkey, nodeconfigids);
    

    Same query.
    Related answers with more details:

    Alternative with unnest()

    If the GIN index isn't an option (or doesn't perform to your expectations) you can still optimize the query.

    It's particularly efficient to unnest long input arrays (or use a VALUES expression like in your example) and then join to the derived table. An IN construct is typically the slowest option.

    SELECT e.pkid, nodejoinkey
    FROM  (
       SELECT DISTINCT n.nodejoinkey 
       FROM  (SELECT nodejoinkey, unnest(nodeconfigids) AS nodeconfigid
              FROM   expectedreportsnodes) n
       JOIN  (VALUES ('cf1'), ('cf2'), ..., ('cfN')) t(nodeconfigid) USING (nodeconfigid)
       ) n
    JOIN   expectedreports e USING (nodejoinkey);
    

    Modern form in Postgres 9.3+ with an implicit JOIN LATERAL:

    SELECT e.pkid, nodejoinkey
    FROM  (
       SELECT DISTINCT n.nodejoinkey 
       FROM  expectedreportsnodes n
           , unnest(n.nodeconfigids) nodeconfigid
       JOIN  unnest('{cf1, cf2, ..., cfN}'::text[]) t(nodeconfigid) USING (nodeconfigid)
       ) n
    JOIN   expectedreports e USING (nodejoinkey);
    

    For short input arrays, an ANY construct is faster:

    SELECT e.pkid, nodejoinkey
    FROM  (
       SELECT DISTINCT e.nodejoinkey 
       FROM   expectedreportsnodes e
       JOIN   unnest(e.nodeconfigids) u(nodeconfigid) 
              ON u.nodeconfigid = ANY ('{cf1, cf2, ..., cfN}'::text[])
       ) n
    JOIN   expectedreports e USING (nodejoinkey);