postgresql common-table-expression jsonb hierarchical-data nested-object

Search for multiple values within the hierarchy of the same specified folder in a flat representation of a tree structure in a Postgres jsonb table

Data structure: The table has an PK 'id' and a jsonb column 'data'. The 'data' contains an array of objects 'instances'. Each 'instance' has some values and a 'path' array. The 'path' array is a flat representation of a deeply nested tree like hierarchical structure. Each 'path' consists of objects that have a string 'id', which is not unique, and an integer 'index' which is only unique on the same level (the same parent structure), but can repeat on different levels.

Example:

    "instances": [
        {
        "path": [
            {"id": "root", "index": 2},
            {"id": "folder1", "index": 0},
            {"id": "folder2", "index": 0},
            {"id": "folder3", "index": 0}
            ],
        "pdf": "pdf in 1,2,3",
        "info": "some other data"
        },
        ...
    ]

I need to be able to search for multiple specific values within the hierarchy of the same specified folder in a Postgres jsonb table.

For example, search for an item that has both "pdf in 1,2,3" AND "text in 1,2,3" values within the hierarchy of the same folder2 (meaning the folder2 within the same parent structure).

Here's the query I came up with:

    WITH indexed_paths AS ( 
        SELECT id, instance -> 'index' as inst_idx,
        MIN(CASE WHEN path_element @> '{"id": "folder2"}' THEN path_idx END)
            OVER (PARTITION BY id, instance -> 'index') AS searched_index,
        path_idx, path_element, instance
        FROM "flat",
             jsonb_array_elements(data -> 'instances') AS instance,
             jsonb_array_elements(instance -> 'path') WITH ORDINALITY arr(path_element, path_idx)
        WHERE
        instance -> 'path' @> '[{"id": "folder2"}]'
        ORDER BY id, inst_idx, path_idx
    ), combined_paths AS ( 
        SELECT id, jsonb_agg(path_element) as path, instance
        FROM indexed_paths
        WHERE path_idx <= searched_index
        GROUP BY id, inst_idx, instance
    ), combined_instances AS ( 
        SELECT id, path, jsonb_agg(instance) as instances
        FROM combined_paths
        GROUP BY id, path
    )
    SELECT * 
    FROM "flat" f
    WHERE EXISTS (
        SELECT 1
        FROM combined_instances ci
        WHERE
            ci.id = f.id
            AND ci.instances @> '[{"pdf": "pdf in 1,2,3"}, {"jpg": "jpg in 1,2,3"}]'::jsonb
    );

1. indexed_paths CTE expands each row's instances and each path of each instance into a separate row of 'path_element's, enumerating all the path_element's with indexes. If the path_element is the searched one, it takes it's index and writes it to a new column of all the rows with a matching id and instance index. if there are multiple occurrences of the searched path_element within the same id and instance index it takes the smallest one.
2. combined_paths CTE aggregates path_elements grouping them by id's and instance index, checking if the element_path index is <= to the searched index, this way reconstructing element_paths back but only up to the searched path_element.
3. combined_instances CTE aggregates the data of all instances by matching id's and reconstructed paths.
4. final SELECT statement is searching for the specified data inside the combined_instances and joins it with the original table by id's.*

It works exactly the way I want, but it's just too verbose and long. Is there any way to simplify it? Changing the data structure of the jsonb column is an option. Some other algotythm is also welcome. Basically any haelp would be gretly appreciated.

Solution

I'd try to use subqueries and lateral joins more instead of expanding and re-grouping rows in multiple CTEs:

SELECT id, to_jsonb(ancestor_path) AS ancestor_path, instances
FROM "flat" f,
LATERAL (
  SELECT element->>'id' AS ancestor_name, path[0:ancestor.idx] AS ancestor_path, jsonb_agg(instance) AS instances
  FROM jsonb_array_elements(f.data -> 'instances') AS instance,
  jsonb_to_record(instance) AS _i(path jsonb[]),
  unnest(path) WITH ORDINALITY AS ancestor(element, idx)
  GROUP BY path[0:ancestor.idx], element->>'id'
) AS data
WHERE ancestor_name = 'folder2'
  AND instances @> '[{"pdf": "pdf in 1,2,3"}, {"jpg": "jpg in 1,2,3"}]'::jsonb

^{(online demo of this and the below approaches)}

or with EXISTS, if you want only the whole flat row (regardless how many matches there are within its data):

SELECT *
FROM "flat" f
WHERE EXISTS (
  SELECT 1
  FROM jsonb_array_elements(f.data -> 'instances') AS instance,
  jsonb_to_record(instance) AS _i(path jsonb[]),
  unnest(path) WITH ORDINALITY AS ancestor(element, idx)
  WHERE element->>'id' = 'folder2'
  GROUP BY path[0:ancestor.idx]
  HAVING jsonb_agg(instance) @> '[{"pdf": "pdf in 1,2,3"}, {"jpg": "jpg in 1,2,3"}]'::jsonb
)

An alternative to the GROUP BY and searching in the aggregated instances array would be a self-join of a CTE relation:

SELECT *
FROM "flat" f
WHERE EXISTS (
  WITH instances AS (
    SELECT value, element->>'id' AS ancestor_name, path[0:ancestor.idx] AS ancestor_path
    FROM jsonb_array_elements(f.data -> 'instances') AS el(value),
    jsonb_to_record(value) AS _i(path jsonb[]),
    unnest(path) WITH ORDINALITY AS ancestor(element, idx)
  )
  SELECT *
  FROM instances a JOIN instances b USING (ancestor_path, ancestor_name)
  WHERE ancestor_name = 'folder2'
    AND a.value @> '{"pdf": "pdf in 1,2,3"}'
    AND b.value @> '{"jpg": "jpg in 1,2,3"}'
)

Instead of returning ancestor_name as a separate column from the subquery, which makes the GROUP BY or JOIN … USING more ugly, you can also just access the last element of ancestor_path:

SELECT *
FROM "flat" f,
LATERAL (
  SELECT to_jsonb(path[0:ancestor.idx]) AS ancestor_path, jsonb_agg(instance) AS instances
  FROM jsonb_array_elements(f.data -> 'instances') AS instance,
  jsonb_to_record(instance) AS _i(path jsonb[]),
  unnest(path) WITH ORDINALITY AS ancestor(element, idx)
  GROUP BY path[0:ancestor.idx], element->>'id'
) AS data
WHERE ancestor_path->-1->>'id' = 'folder2'
  AND instances @> '[{"pdf": "pdf in 1,2,3"}, {"jpg": "jpg in 1,2,3"}]'::jsonb

I guess the main trick that makes these queries shorter than yours is the use of array slicing to generate the ancestor path(s) and the use of jsonb_to_record for converting to jsonb array to a postgres array. This probably could have been achieved in any number of ways, including your window function (which notably only finds the upper path if there are two nested "folder2"s):

SELECT id, ancestor_path, instances
FROM "flat" f,
LATERAL (
  SELECT (
    SELECT jsonb_agg(element) FILTER (WHERE idx <= searched_index)
    FROM (
      SELECT *, MIN(idx) FILTER (WHERE element @> '{"id": "folder2"}') OVER () AS searched_index
      FROM jsonb_array_elements(instance -> 'path') WITH ORDINALITY anc(element, idx)
    ) AS path
  ) AS ancestor_path, jsonb_agg(instance) AS instances
  FROM jsonb_array_elements(f.data -> 'instances') AS instance
  GROUP BY ancestor_path
) AS data
WHERE instances @> '[{"pdf": "pdf in 1,2,3"}, {"jpg": "jpg in 1,2,3"}]'::jsonb