Select rows with jsonb column that have different values in one jsonb attribute and matching values in another jsonb attribute

I have a table notifications which contain a payload column of type jsonb with a gin index on this column. The talbe currently contains 2,742,691 rows

The table looks something like this:

id	payload	created_at
1	{"customer": {"email": "[email protected]", "externalId": 111 }	2022-06-21
2	{"customer": {"email": "[email protected]", "externalId": 222 }	2022-06-20
3	{"customer": {"email": "[email protected]", "externalId": 333 }	2022-06-20
4	{"customer": {"email": "[email protected]", "externalId": 444 }	2022-04-14
5	{"customer": {"email": "[email protected]", "externalId": 555 }	2022-04-12
6	{"customer": {"email": "[email protected]", "externalId": 666 }	2022-06-10
7	{"customer": {"email": "[email protected]", "externalId": 666 }	2022-06-11

I am trying to query a list of email addresses that match the following condition:

multiple rows for the same email address exist
one of those rows does have a different externalId than one of the previous ones
created_at is within the last month

For the example table contents, this should only return [email protected] because

[email protected] only appears once
[email protected] doesnt have a row that was created within the last month
[email protected] has multiple rows but all of them have the same externalId

What I was trying is using a LEFT JOIN LATERAL like this:

select
  n.payload -> 'customer' -> 'email'
from
  notifications n
  left join lateral (
    select
      n2.payload -> 'customer' ->> 'externalId' tid
    from
      notifications n2
    where
      n2.payload @> jsonb_build_object(
        'customer',
        jsonb_build_object('email', n.payload -> 'customer' -> 'email')
      )
      and not (n2.payload @> jsonb_build_object(
        'customer',
        jsonb_build_object('externalId', n.payload -> 'customer' -> 'externalId')
      ))
      and n2.created_at > NOW() - INTERVAL '1 month' 
    limit
      1
  ) sub on true
where
  n.created_at > NOW() - INTERVAL '1 month'
  and sub.tid is not null;

however, this is taking ages to run. The Query plan for this looks like https://explain.depesz.com/s/mriB

QUERY PLAN
Nested Loop  (cost=0.17..53386349.38 rows=259398 width=32)
  ->  Index Scan using index_notifications_created_at on notifications n  (cost=0.09..51931.08 rows=259398 width=514)
        Index Cond: (created_at > (now() - '1 mon'::interval))
  ->  Subquery Scan on sub  (cost=0.09..205.60 rows=1 width=0)
        Filter: (sub.tid IS NOT NULL)
        ->  Limit  (cost=0.09..205.60 rows=1 width=32)
              ->  Index Scan using index_notifications_created_at on notifications n2  (cost=0.09..53228.33 rows=259 width=32)
                    Index Cond: (created_at > (now() - '1 mon'::interval))
                    Filter: ((payload @> jsonb_build_object('customer', jsonb_build_object('email', ((n.payload -> 'customer'::text) -> 'email'::text)))) AND (NOT (payload @> jsonb_build_object('customer', jsonb_build_object('externalId', ((n.payload -> 'customer'::text) -> 'externalId'::text))))))
JIT:
  Functions: 13
  Options: Inlining true, Optimization true, Expressions true, Deforming true

Any pointers what I'm doing wrong here / how to optimize this?

Solution

PostgreSQL's planner has no real insight into the internals of the JSON values, so it doesn't know how many rows from n2 are expected to have the same email as some row from n does. For that matter, it doesn't even know that that is the question being considered, as it doesn't understand how @> interacts with the inner workings of jsonb_build_object. So it just uses some very generic row estimates for the planning, and probably overestimates the number of rows substantially.

Your best bet is probably to pull the email and externalId out of the JSONB and put them into real columns. This will make it easier both to make better plans available, and make better info available so that the planner can choose those better plans.

The better plan would be to use a composite index on (email, created_at) so it can jump directly to the part of the index that satisfies both the email equality condition and the created_at inequality simultaneously.

If you can't refactor the data, then you could at least use an multi-column expressional index to get much of the benefit, for example on notifications ((payload -> 'customer' ->> 'email'), created_at) Then you would need to rewrite the first @> condition in your query to look like:

n2.payload->'customer'->>'email' = n.payload -> 'customer' ->> 'email'

You could also include the externalId expression as the 3rd column in the index, in which case the other @> condition would need to be rewritten in an analogous way.

After building an expressional index, you should immediately manually ANALYZE the table, otherwise the planner won't have the expression statistics it might need to make the best choice.