I have a table that looks like (with an example of number of rows in each to get the kind of ration):
expectedreportsnodes (1 000 000 rows):
nodejoinkey | integer | not null
nodeid | text | not null
nodeconfigids | text[] |
nodeconfigids
array generally contains 1-50 values.
And a second table:
expectedreports
(10 000 rows):
pkid | integer | not null
nodejoinkey| integer | not null
...
I want to query for all expected reports for which their exists a entry in nodeexpectedreports
with a given nodeConfigId
.
I have potentially large amount of nodeConfigIds
(thousands).
What is most efficient way of doing that?
For now, I have:
select E.pkid, E.nodejoinkey from expectedreports E
inner join (
select NN.nodejoinkey, NN.nodeid, NN.nodeconfigids from (
select N.nodejoinkey, N.nodeid, unnest(N.nodeconfigids) as nodeconfigids
from expectedreportsnodes N
) as NN
where NN.nodeconfigids) IN( VALUES ('cf1'), ('cf2'), ..., ('cf1000'), ..., ('cfN') )
) as NNN on E.nodejoinkey = NNN.nodejoinkey;
This seems to give the expected results but takes ages to execute.
What can be done to improve the query?
Updates:
_
select E.pkid, E.nodejoinkey from expectedreports E
inner join (
select NN.nodejoinkey, NN.nodeconfigids
from (
select N.nodejoinkey, N.nodeconfigids,
generate_subscripts(N.nodeconfigids,1) as v
from expectedreportsnodes N
) as NN
where NN.nodeconfigids[v] in(values ('cf1'), ('cf2'), ..., ('cf1000'), ..., ('cfN') )
) as NNN
on E.nodejoinkey = NNN.nodejoinkey
The key to performance is a GIN index on the array column. And work with operators that can use the index.
CREATE INDEX ern_gin_idx ON expectedreportsnodes USING gin (nodeconfigids);
Query:
SELECT e.pkid, nodejoinkey
FROM expectedreports e
JOIN expectedreportsnodes n USING (nodejoinkey)
WHERE n.nodeconfigids && '{cf1, cf2, ..., cfN}'::text[];
This should work just fine for arrays of text
, because the overlap operator &&
is supported by the default GIN operator class. Per documentation:
Name Indexed Data Type Indexable Operators
...
_text_ops text[] && <@ = @>
...
Also be sure to have a plain btree index on expectedreports.nodejoinkey
:
CREATE INDEX expectedreports_nodejoinkey_idx ON expectedreports (nodejoinkey);
To optimize further for the given query, you could include the otherwise not useful column nodejoinkey
to the index to allow index-only scans.
To include the integer
column, first install the additional module btree_gin
, which provides the necessary GIN operator classes. Run once per database:
CREATE EXTENSION btree_gin;
Then:
CREATE INDEX ern_multi_gin_idx ON expectedreportsnodes
USING gin (nodejoinkey, nodeconfigids);
Same query.
Related answers with more details:
unnest()
If the GIN index isn't an option (or doesn't perform to your expectations) you can still optimize the query.
It's particularly efficient to unnest long input arrays (or use a VALUES
expression like in your example) and then join to the derived table. An IN
construct is typically the slowest option.
SELECT e.pkid, nodejoinkey
FROM (
SELECT DISTINCT n.nodejoinkey
FROM (SELECT nodejoinkey, unnest(nodeconfigids) AS nodeconfigid
FROM expectedreportsnodes) n
JOIN (VALUES ('cf1'), ('cf2'), ..., ('cfN')) t(nodeconfigid) USING (nodeconfigid)
) n
JOIN expectedreports e USING (nodejoinkey);
Modern form in Postgres 9.3+ with an implicit JOIN LATERAL
:
SELECT e.pkid, nodejoinkey
FROM (
SELECT DISTINCT n.nodejoinkey
FROM expectedreportsnodes n
, unnest(n.nodeconfigids) nodeconfigid
JOIN unnest('{cf1, cf2, ..., cfN}'::text[]) t(nodeconfigid) USING (nodeconfigid)
) n
JOIN expectedreports e USING (nodejoinkey);
DISTINCT
.JOIN LATERAL
:
ANY
construct is faster:SELECT e.pkid, nodejoinkey
FROM (
SELECT DISTINCT e.nodejoinkey
FROM expectedreportsnodes e
JOIN unnest(e.nodeconfigids) u(nodeconfigid)
ON u.nodeconfigid = ANY ('{cf1, cf2, ..., cfN}'::text[])
) n
JOIN expectedreports e USING (nodejoinkey);