Search code examples
sqljsonpostgresqlpostgresql-json

Recursively find json by key && value and update in-place


I'm trying to find how I can search for AND update a value inside an array of JSON objects, where the key and value I'm looking for, can be at any different levels.

I've created a dbfiddle with some sample data. To make it easier to read the JSON data, here's a gist with a pretty format.

So, let's take this in parts:

  • What am I trying to achieve?

    I'm trying to replace the value of url in the media_items table, and the values of url and originalUrl in the documents_revisions table with new values, for a server migration. url will point to my server and original_url will point to an AWS S3 bucket.

Let's look at this media_item entry example:

// media_item simplified example...
{
  "key": "2021/1/29/04aaf6f1-bf13-4918-a062-355bdf192d2f.jpeg",
  "url": "https://my.old.url/api/v1/images/4166289-f-542-d-445-b-a-0-d-1-ec-7916-c-0748-e.jpeg?id=MQkvMjAyMi8wMS8yMC80MTY2Mjg5Zi01NDJkLTQ0NWItYTBkMS1lYzc5MTZjMDc0OGUuanBlZwk4OTY%3D",
}

This key is the actual path/URL to the resource on my server and s3 bucket, and url allows me to get the "original" value, which is then used in the documents/revisions in document_revisions. The "original" url also contains a base64 encoded id param, which is just the key as a base64 string.

  • How am I trying to do it?

The current approach I'm taking, has me:

  1. read the values of url and key from the media_items table (in the fiddle) into a item record variable (which then get iterated upon).

  2. I then load the url value into a old_url variable (to re-use it during "searches")

old_url := "https://my.old.url/api/v1/images/4166289-f-542-d-445-b-a-0-d-1-ec-7916-c-0748-e.jpeg?id=MQkvMjAyMi8wMS8yMC80MTY2Mjg5Zi01NDJkLTQ0NWItYTBkMS1lYzc5MTZjMDc0OGUuanBlZwk4OTY%3D"
  1. create a new_url variable with the new value, which is something like:
new_url := CONCAT('https://my.new.url', '/', '"' , item.key, '"')

-- output: https://my.new.url/2021/1/29/04aaf6f1-bf13-4918-a062-355bdf192d2f.jpeg
  1. With these variables, I'll update the media_items.asset->'url' entry to new_url AND look for the value of old_url in ALL revisions in document_revisions:
do $$
-- (...)

    -- Update the media_items asset->'url'
    UPDATE media_items mle
    SET asset = REPLACE(
      (mle.asset)::TEXT,
      CONCAT('"url": ', old_url
      CONCAT('"url": ', '"', new_url, '"')
    )::JSONB;
    
    -- Update the document_revisions 
    UPDATE document_revisions dr
    SET data = REPLACE(
        (dr.data)::TEXT,
        CONCAT('"url": ', old_url),
        CONCAT('"url": ', '"', new_url, '"')
    )::JSONB;

    UPDATE document_revisions dr
    SET data = REPLACE(
        (dr.data)::TEXT,
        CONCAT('"originalUrl": ', old_url),
        CONCAT('"originalUrl": ', '"', new_url, '"')
    )::JSONB;

-- (...)
end; $$;

Now, as you can see, this is not a very good/performant approach as I'm relying on parsing JSON to text, finding a certain substring (which represents a JSON key: value) and then replacing it with the new value.

It is, however, fairly simple to target the url key in the media_items table, so I could simply change to a more directed approach:

-- A much better solution to the media_items update above
UPDATE
  media_items mle
SET 
  asset = JSONB_SET(asset, '{url}', CONCAT('"', new_url, '"')::JSONB)
WHERE
  item.id = mle.id;

On the other hand, the same is NOT true for the revisions in document_revisions, as the url and originalUrl can be at any depth level of json objects in the document_revisions.data->'data'->'content' item array (look at the sample data in the gist, it's pretty complete even though text components have been severely shrunk of text for the sake of readability). I'm not sure how to get the array index of the json object to update and update it's property/key

I did find a couple possible solutions that used jsonb operators, but I'm not sure how to use them to achieve my intended result:

-- This will return all the revisions which contain an object, at any depth, whose key is `url`.
SELECT dr.*
FROM   document_revisions dr
WHERE  (dr.data->'content')::jsonb @? 'strict $.** ? (exists (@."url"))';

-- This will return all the revisions that contain an json value equal to the defined url
SELECT dr.*
FROM   document_revisions dr
WHERE  (dr.data->'content')::JSONB @? 'strict $.** ? (@ == "https://my.old.url/api/v1/images/covid-19-m-rna-vaccines.png?id=MQkvMjAyMS82LzEvZWI2YzQwZjItMjYzMS00MTY4LWFhZGQtNjZjMmVkOTg5ZDQ2LnBuZwk4MTQ=")';

I tried combining the two options (meaning, find by key AND value) with something to the likes of the next example, but that doesn't work, nor have I found an iteration of this that works:

SELECT dr.*
FROM   document_revisions dr
WHERE  (dr.data->'content')::jsonb @? 'strict $.** ? (exists (@."url") && @ == "https://server.livingdocs.io/api/v1/images/a.jpg?id=MQkvMjAyMC8xMS8yNS9mNjgyN2E5ZC0xZjViLTQzZGUtOGNiNy1iY2RhNjgyYTFhMjMuanBlZwk4OTY=")';

The only simple way I've found so far to do this is by doing string replacements, but this is slow and ineffective. It took me over 2 hours to update 150 media assets, which can then correspond to any amount of revisions between 1 and N. I currently have 6k assets and 23k revisions, where each revision can use N assets, so it's not being good...

Now, my questions are:

  • Is there a way to update in-place, a given json key/property at any index and level of an array?
  • Is there a way to speed up the string replacement?
  • How would you do this? I've tried creating temporary tables, unlogged tables, creating unindexed copies of the tables, and nothing works... Help please?

Thanks


EDIT:

Based on @Ajax1234 answer (thank you) the final solution ended up being a set of functions that are available in this dbfiddle


Solution

  • A possible method for handling arbitrarily nested data is to find all the JSON paths in document_revisions.data that point to the value of url or originalUrl keys and then use a function with a loop to repeatedly update document_revisions in place.

    First, update media_items:

    update media_items set asset = asset || jsonb_build_object('new_url', 'https://my.new.url/'||((asset -> 'key')#>>'{}'));
    

    Then, create a cte that will return all paths for every value in each data:

    create or replace function document_paths() returns table(id text, path text, arr_js jsonb, obj_js jsonb)
    language sql
    as $$
      with recursive cte(id, path, arr_js, obj_js) as (
         select d.id, '', case when jsonb_typeof(d.data) = 'array' then d.data 
                      else '[null]'::jsonb end, 
                case when jsonb_typeof(d.data) = 'object' then d.data 
                  else '{"null":null}'::jsonb end
         from document_revisions d
         union all
         select c.id, c.path ||(case when c.path != '' then ',' else '' end)||(case when arr_js.elem#>>'{}' is null then obj_js.key else arr_js.i::text end), 
            case when arr_js.elem#>>'{}' is null then 
               (case when jsonb_typeof(obj_js.value) = 'array' then obj_js.value 
                     else '[null]'::jsonb end) else 
                     (case when jsonb_typeof(arr_js.elem) = 'object' then '[null]'::jsonb when jsonb_typeof(arr_js.elem) = 'array' then arr_js.elem else '[]'::jsonb end) end,
             case when obj_js.value#>>'{}' is null then 
               (case when jsonb_typeof(arr_js.elem) = 'object' then arr_js.elem 
                     else '{"null":null}'::jsonb end) else 
                     (case when jsonb_typeof(obj_js.value) = 'object' then obj_js.value 
                           else (case when jsonb_typeof(obj_js.value) = 'array' then '{"null":null}'::jsonb 
                                   else '{}'::jsonb end) end) end
         from cte c 
         cross join lateral (select row_number() over (order by 1) - 1 i, t.value elem 
                     from jsonb_array_elements(c.arr_js) t) arr_js 
         cross join jsonb_each(c.obj_js) obj_js
      )
      select * from cte
    $$
    

    Lastly, create a function (this can also be done in a procedure) to perform the updates:

    create or replace function run_updates()
      returns void as
    $body$
    declare
        c record;
    begin
      for c in select * from document_paths() -- loop over each path
      loop
        if c.obj_js = '{}'::jsonb and (c.path ~ 'url' or c.path ~ 'originalUrl') then -- check if path is pointing to a url or originalUrl
          -- perform update
          update document_revisions set data = jsonb_set(data, ('{'||c.path||'}')::text[], m.asset -> 'new_url') 
          from media_items m where document_revisions.id = c.id and 
            ((data::jsonb#>('{'||c.path||'}')::text[])#>>'{}') like ((m.asset -> 'url')#>>'{}')||'%';
        end if;
      end loop;
    end
    $body$
    language plpgsql
    

    Explanation of update query:

    update document_revisions set data = jsonb_set(data, ('{'||c.path||'}')::text[], 
          m.asset -> 'new_url') -- using the path to the url in the document, update the url with the new_url in media_items
    from media_items m -- join previously updated media_items to document_revisions
    -- if there is a basic substring match (you might want to change the exact match conditions later on) between the old_url in media_values and the url found in document_revisions via the path
    -- then perform the update above
    where document_revisions.id = c.id and 
            ((data::jsonb#>('{'||c.path||'}')::text[])#>>'{}') like ((m.asset -> 'url')#>>'{}')||'%'
    

    See fiddle


    A more memory efficient approach can be to find all url and originalUrls from document_revisions and then perform an in-place series of updates in a function. In this way, data is not being repeatedly loaded into memory all at once, as it is for the solution above.

    First, update media_items:

    update media_items set asset = asset || jsonb_build_object('new_url', 
       'https://my.new.url/'||((asset -> 'key')#>>'{}'));
    

    Then, using a procedure to find all urls in document_revisions and then performing the replacements:

    create or replace function run_updates()
      returns void as
    $body$
    declare
        c record;
    begin
      for c in (select t.id, url from (select d.id, 
            regexp_matches(d.data::text, '(?:(?<=url"\:)(?:\s+)*"([^"]+))|(?:(?<=Url"\:)(?:\s+)*"([^"]+))', 'g') v 
          from document_revisions d) t cross join unnest(t.v) url
          where url is not null)
      loop
         update document_revisions set data = replace(data::text, c.url, ((m.asset -> 'new_url')#>>'{}'))::jsonb from media_items m where document_revisions.id = c.id and 
            c.url like ((m.asset -> 'url')#>>'{}')||'%';
      end loop;
    end
    $body$
    language plpgsql
    

    See fiddle

    As with the previous solutions, a replacement of url or originalUrl with media_items.new_url only occurs when the latter matches as a substring of the former:

    c.url like ((m.asset -> 'url')#>>'{}')||'%'