Search code examples
postgresql

Effective JSON array search in batches, with data extraction


Data model

My database contains a table with unique id and JSON blob. JSON blob contains, among other fields, a list of "identifiers" that I have an index on. Identifiers are of a given type (ISIN, VALOR, etc.).
The identifier is held in a JSON subobject, with possibly a priority class (primary, or linked): {<type>: <identifier value>, <class>: true}.
By having each identifier in its own subobject, the array of identifiers can hold multiple identifiers of the same type, for example:

    {
        "identifiers": [
          {"isin": "XS1", "primary": true},
          {"valor": 1234, "linked": true},
          {"isin": "EXTRA"}
        ],
        "workflows": {},
        "slots": {}
    }

Lookup queries

Initial, single field search

Previously I had a query that provided a single array element, to be searched among all JSON blobs "identifiers" lists and return the Ids of those elements. For example the search criteria: [ {"isin": "XS1"} ] would return true for the aforementioned example.

New search rules

Now I want the priority class to filter more precisely.
Basically the conditions are as follows:

Return a record if it contains an identifier with the given type and value (of the search criteria):

  • and primary = true,
  • or linked = true but the object holds no primary identifier of that type
  • or one with no primary or linked class (either having no class at all, or a different class), and there exists no primary or linked identifier of the searched type in the object

Thus the search criteria [ {"isin": "EXTRA"} ] would return false for below identifiers

'{
        "identifiers": [
          {"isin": "XS1", "primary": true},
          {"valor": 1234},
          {"isin": "EXTRA", "linked":true}
        ],
        "workflows": {},
        "slots": {}
    }'

because isin and EXTRA match but there is already an isin which is primary, thus masking all other isins.

Desired query

Query plan for this single item query was good, because i could easly hardcode certain values like "type" in the query itself. Now I would like to make the query resolve a batch of identifiers the same way, i.e.:

'[
      [{"isin": "XS12"}, {"id": "XS12"}, {"schema" : "isin"}],
      [{"valor": 1234}, {"id": 1234}, {"schema" : "valor"}],
      [{"isin": "EXTRA"}, {"id": "EXTRA"}, {"schema" : "isin"}]
    ]'

The query is: Fiddle

WITH scheme_and_id AS (
    -- Convert the input JSON object into rows of key-value pairs
    SELECT jsonb_array_elements('[
      [{"isin": "XS12"}, {"id": "XS12"}, {"schema" : "isin"}],
      [{"valor": 1234}, {"id": 1234}, {"schema" : "valor"}],
      [{"isin": "EXTRA"}, {"id": "EXTRA"}, {"schema" : "isin"}]
    ]'::jsonb) as batch_json
),
     -- first build the JSON to be used to match the index
     resolve_id as (select -- select the JSON objects
                           "id",
                           (blob -> 'identifiers')    as "ids",
                           (scheme_and_id.batch_json[1] -> 'id')::text as identifier,
                           trim((scheme_and_id.batch_json[2] -> 'schema')::text, '"') as selected_schema
                    from blobstable, scheme_and_id
                    where (blob -> 'identifiers') @>  jsonb_build_array( scheme_and_id.batch_json[0])),
     candidates as (
         SELECT
             "id",
             selected_schema  as "schema",
             identifier as "identifier",
             (single_identifier -> 'linked')::boolean as "linked",
             (single_identifier -> 'primary')::boolean as "primary",
             (single_identifier -> selected_schema)::TEXT as selected_id,
             single_identifier
         FROM resolve_id ,
              LATERAL jsonb_array_elements("ids") single_identifier
         where  (single_identifier -> selected_schema)::TEXT is not null
     )
SELECT *
from candidates
where ("primary" and identifier = selected_id)        -- first rule
   or ("linked" and identifier = selected_id and
       not exists(select "id" from candidates where "primary")) -- second rule
   or (("primary" is Null or False) and ("linked" is Null or False) and identifier = selected_id and
       not exists(select "id" from candidates where "linked")) -- third rule
;

And the explanation: I provide arrays of params like so: [{"isin": "XS12"}, {"id": "XS12"}, {"schema" : "isin"}]

{"isin": "XS12"} is the identifier to be matched in:

where (blob -> 'identifiers') @>  jsonb_build_array( scheme_and_id.batch_json[0]))

{"id": "XS12"} and {"schema" : "isin"} parameters are needed to easily match it in the candidates conditions (extracting this dynamically was challenging because the type name can be isin, valor etc, as seen here: (trim is used because somehow schema was being returned with quotation marks

(scheme_and_id.batch_json[1] -> 'id')::text as identifier,
trim((scheme_and_id.batch_json[2] -> 'schema')::text, '"') 

After those operations "resolve_id" contains the "id" and "identifiers" blob of all arrays that contain identifiers from the "scheme_and_id"

And finally in "candidates" I extract needed fields from the blob for the final "where" logic. Addition of this "candidates" ads another nested loop to the explain plan. I am looking for a way to optimize this query. I was even thinking about making or the "where" logic using @> operator, to basically check if contains "type":"value" primary - true, and not contains others, but I see no way to do it like "not contains "isin": any, primary: true

Working example: Fiddle

Basic create table with test data, expected in the repsonse are all 3 selected ids

CREATE TABLE blobstable 
(
  id integer primary key,
  blob jsonb
);

INSERT INTO blobstable 
VALUES 
(1, '{
    "identifiers": [
      {"isin": "XS1", "primary": true},
      {"valor": 1234},
      {"isin": "EXTRA"}
    ],
    "workflows": {},
    "slots": {}
}'),
(2, '{
     "identifiers": [
      {"isin": "XS3", "primary": true},
      {"valor": 456, "linked": true},
      {"cusip": "TRGYN"}
    ],
    "workflows": {},
    "slots": {}
}'),
  (3, '{
    "identifiers": [
      {"isin": "XS12", "primary": true},
      {"valor": 678},
      {"isin": "XS5"}
    ],
    "workflows": {},
    "slots": {}
}')
;


Solution

  • I went with final query, as seen here: https://dbfiddle.uk/gwzGZX8j

           WITH scheme_and_id as (
                    select     (id_json -> 'identifier') AS queried_id,
                               trim((id_json-> 'scheme')::text, '"')AS scheme,
                               ordinal
                    from jsonb_array_elements('[
          {"identifier": "XS12", "scheme" : "isin"},
          {"identifier": 1234, "scheme" : "valor"},
          {"identifier": "EXTRA", "scheme" : "isin"}
        ]'::jsonb) WITH ORDINALITY as f(id_json, ordinal)),
                     -- first build the JSON to be used to match the index
                     resolve_id as (select -- select the JSON objects
                                           id,
                                           (blob -> 'identifiers')     as "ids",
                                           ordinal,
                                           queried_id,
                                           scheme
                                     from blobstable, scheme_and_id
                                    where (blob -> 'identifiers')  @>  jsonb_build_array( jsonb_build_object(scheme, queried_id))),
                     candidates as (
                         SELECT
                             id,
                             ordinal,
                             scheme,
                             queried_id,
                             (identifier_row -> scheme) as candidate_id,
                             (identifier_row -> 'primary')::boolean as "primary",
                             (identifier_row -> 'linked')::boolean as "linked"
                         FROM resolve_id ,
                              LATERAL jsonb_array_elements("ids") identifier_row
                     )
                SELECT id
                from candidates c
                         right join scheme_and_id on c.ordinal = scheme_and_id.ordinal
                    and ((c."primary" and c.queried_id = c.candidate_id)                                            -- first rule
                    or  (c."linked"   and c.queried_id = c.candidate_id
                        and not exists(select "id" from candidates where c.ordinal = candidates.ordinal and candidate_id is not null and "primary"))  -- second rule
                    or (("primary" is Null or False) and ("linked" is Null or False) and c.queried_id = c.candidate_id
                        and not exists(select "id" from candidates where c.ordinal = candidates.ordinal and candidate_id is not null and "linked")))  -- third rule
                order by scheme_and_id.ordinal;
    

    So I added sorting, since I need to return NULL for not found identifiers. I also improved the way I provide the input into the query to drop the nested arrays and value duplication. I also decided to go with this where clause, because it moves all the conditional logic to the final step, and I believe is easier to read.

           from candidates c
                         right join scheme_and_id on c.ordinal = scheme_and_id.ordinal
                    and ((c."primary" and c.queried_id = c.candidate_id)                                            -- first rule
                    or  (c."linked"   and c.queried_id = c.candidate_id
                        and not exists(select "id" from candidates where c.ordinal = candidates.ordinal and candidate_id is not null and "primary"))  -- second rule
                    or (("primary" is Null or False) and ("linked" is Null or False) and c.queried_id = c.candidate_id
                        and not exists(select "id" from candidates where c.ordinal = candidates.ordinal and candidate_id is not null and "linked")))  -- third rule
                order by scheme_and_id.ordinal;
    

    Thank you for help