Search code examples
sqldatabasepostgresqljsonb

Postgres removing jsonb object from nested jsonb array


I have a deeply nested json array. The array contains a list of json objects, I want to remove any of the objects that are matched based on a field within the object.

{
  "name": "John smith",
  "items": {
    "unknown-key-1": {
      "file-array": [
        {
          "file-id": "file-1"
        },
        {
          "file-id": "file-2"
        },
        {
          "file-id": "file-3"
        }
      ]
    },
    "unknown-key-2": {
      "file-array": [
        {
          "file-id": "file-1"
        },
        {
          "file-id": "file-2"
        }
      ]
    }
  }
}

So for example, say I want to remove all objects where the file-id is "file-1". My UPDATE statement would remove all jsonb objects matching that file-id. After the UPDATE my jsonb column would look like this:

{
  "name": "John smith",
  "items": {
    "unknown-key-1": {
      "file-array": [
        {
          "file-id": "file-2"
        },
        {
          "file-id": "file-3"
        }
      ]
    },
    "unknown-key-2": {
      "file-array": [
        {
          "file-id": "file-2"
        }
      ]
    }
  }
}

I can achieve this when the array is at the top level of the json, or, when the array is nested inside an object and the key is known. But in this case the keys are dynamically generated (i.e "unknown-key-1", "unknown-key-2")

I understand that under proper circumstances I would normalise the data as this is an anti-pattern, however I have no choice. Also I want to do this using an UPDATE statement instead of a Postgres function.


Solution

  • Modifying the jsonb isn't as tricky as applying that as an update:

    select jdata #- array['items',item,'file-array',file_index-1]::text[]
    from test
        ,jsonb_each(jdata->'items')a(item,contents)
        ,jsonb_array_elements(contents->'file-array')with ordinality b(file_obj,file_index)
    where (file_obj->>'file-id')='file-1';
    

    for each row it lists its items, and for each of those, it lists what's under file-array. This produces the item names and file indices that can be filtered in where and used to subtract the path using #-. Indexes resulting from with ordinality are 1-based, while jsonb arrays are 0-based, hence the -1.

    The problem with using that directly as an update batch is that it results in single, separate, atomic changes based on each file to be removed from each value, and each row would apply only one of those small changes. In your example, only the file under unknown-key-2 or unknown-key-1 would get the file-1 removed, not both. See the notes under update doc:

    When using FROM you should ensure that the join produces at most one output row for each row to be modified. In other words, a target row shouldn't join to more than one row from the other table(s). If it does, then only one of the join rows will be used to update the target row, but which one will be used is not readily predictable.

    You'd have to either keep running the same update until you no longer see rows being affected, or you need to squash the updates so that each row gets only one cumulative change to apply: demo at db<>fiddle

    with recursive required_changes as (
        select id,
          jdata,
          item,
          file_index-1 as file_index,
          row_number()over(partition by id) as update_round
        from test
            ,jsonb_each(jdata->'items')a(item,contents)
            ,jsonb_array_elements(contents->'file-array')
             with ordinality b(file_obj,file_index)
        where (file_obj->>'file-id')='file-1')
    ,iteratively_merged_changes as (
        select id,
             jdata#-array['items',item,'file-array',file_index]::text[] jdata,
             2 as next_round
        from required_changes where update_round=1
        union
        select a.id,
             a.jdata#-array['items',item,'file-array',file_index]::text[],
             a.next_round+1
        from iteratively_merged_changes a join required_changes b 
        on a.id=b.id 
        and b.update_round=a.next_round)
    ,final_batch as (
        select distinct on(id)id,jdata 
        from iteratively_merged_changes order by id,next_round desc)
    update test t set jdata=f.jdata
    from final_batch f where t.id=f.id
    returning t.*;
    

    The first CTE finds what path needs to be removed from which row, second one keeps applying those changes iteratively, one on top of another and the third one with distinct on just passes the final value from the latest round with all changes applied, to the outer update.

    WITH needs to be RECURSIVE for the second one to be able to self-reference, but only that one does it. More tests