I have a table notifications
which contain a payload
column of type jsonb
with a gin index on this column. The talbe currently contains 2,742,691 rows
The table looks something like this:
id | payload | created_at |
---|---|---|
1 | {"customer": {"email": "[email protected]", "externalId": 111 } | 2022-06-21 |
2 | {"customer": {"email": "[email protected]", "externalId": 222 } | 2022-06-20 |
3 | {"customer": {"email": "[email protected]", "externalId": 333 } | 2022-06-20 |
4 | {"customer": {"email": "[email protected]", "externalId": 444 } | 2022-04-14 |
5 | {"customer": {"email": "[email protected]", "externalId": 555 } | 2022-04-12 |
6 | {"customer": {"email": "[email protected]", "externalId": 666 } | 2022-06-10 |
7 | {"customer": {"email": "[email protected]", "externalId": 666 } | 2022-06-11 |
I am trying to query a list of email addresses that match the following condition:
email
address existexternalId
than one of the previous onescreated_at
is within the last monthFor the example table contents, this should only return [email protected]
because
[email protected]
only appears once[email protected]
doesnt have a row that was created within the last month[email protected]
has multiple rows but all of them have the same externalId
What I was trying is using a LEFT JOIN LATERAL
like this:
select
n.payload -> 'customer' -> 'email'
from
notifications n
left join lateral (
select
n2.payload -> 'customer' ->> 'externalId' tid
from
notifications n2
where
n2.payload @> jsonb_build_object(
'customer',
jsonb_build_object('email', n.payload -> 'customer' -> 'email')
)
and not (n2.payload @> jsonb_build_object(
'customer',
jsonb_build_object('externalId', n.payload -> 'customer' -> 'externalId')
))
and n2.created_at > NOW() - INTERVAL '1 month'
limit
1
) sub on true
where
n.created_at > NOW() - INTERVAL '1 month'
and sub.tid is not null;
however, this is taking ages to run. The Query plan for this looks like https://explain.depesz.com/s/mriB
QUERY PLAN
Nested Loop (cost=0.17..53386349.38 rows=259398 width=32)
-> Index Scan using index_notifications_created_at on notifications n (cost=0.09..51931.08 rows=259398 width=514)
Index Cond: (created_at > (now() - '1 mon'::interval))
-> Subquery Scan on sub (cost=0.09..205.60 rows=1 width=0)
Filter: (sub.tid IS NOT NULL)
-> Limit (cost=0.09..205.60 rows=1 width=32)
-> Index Scan using index_notifications_created_at on notifications n2 (cost=0.09..53228.33 rows=259 width=32)
Index Cond: (created_at > (now() - '1 mon'::interval))
Filter: ((payload @> jsonb_build_object('customer', jsonb_build_object('email', ((n.payload -> 'customer'::text) -> 'email'::text)))) AND (NOT (payload @> jsonb_build_object('customer', jsonb_build_object('externalId', ((n.payload -> 'customer'::text) -> 'externalId'::text))))))
JIT:
Functions: 13
Options: Inlining true, Optimization true, Expressions true, Deforming true
Any pointers what I'm doing wrong here / how to optimize this?
PostgreSQL's planner has no real insight into the internals of the JSON values, so it doesn't know how many rows from n2 are expected to have the same email as some row from n does. For that matter, it doesn't even know that that is the question being considered, as it doesn't understand how @> interacts with the inner workings of jsonb_build_object. So it just uses some very generic row estimates for the planning, and probably overestimates the number of rows substantially.
Your best bet is probably to pull the email and externalId out of the JSONB and put them into real columns. This will make it easier both to make better plans available, and make better info available so that the planner can choose those better plans.
The better plan would be to use a composite index on (email, created_at)
so it can jump directly to the part of the index that satisfies both the email equality condition and the created_at inequality simultaneously.
If you can't refactor the data, then you could at least use an multi-column expressional index to get much of the benefit, for example on notifications ((payload -> 'customer' ->> 'email'), created_at)
Then you would need to rewrite the first @> condition in your query to look like:
n2.payload->'customer'->>'email' = n.payload -> 'customer' ->> 'email'
You could also include the externalId expression as the 3rd column in the index, in which case the other @> condition would need to be rewritten in an analogous way.
After building an expressional index, you should immediately manually ANALYZE the table, otherwise the planner won't have the expression statistics it might need to make the best choice.