I'm trying to find how I can search for AND update a value inside an array of JSON objects, where the key and value I'm looking for, can be at any different levels.
I've created a dbfiddle with some sample data. To make it easier to read the JSON data, here's a gist with a pretty format.
So, let's take this in parts:
url
in the media_items
table, and the values of url
and originalUrl
in the documents_revisions
table with new values, for a server migration. url
will point to my server and original_url
will point to an AWS S3 bucket.Let's look at this media_item
entry example:
// media_item simplified example...
{
"key": "2021/1/29/04aaf6f1-bf13-4918-a062-355bdf192d2f.jpeg",
"url": "https://my.old.url/api/v1/images/4166289-f-542-d-445-b-a-0-d-1-ec-7916-c-0748-e.jpeg?id=MQkvMjAyMi8wMS8yMC80MTY2Mjg5Zi01NDJkLTQ0NWItYTBkMS1lYzc5MTZjMDc0OGUuanBlZwk4OTY%3D",
}
This key
is the actual path/URL to the resource on my server and s3 bucket, and url
allows me to get the "original" value, which is then used in the documents/revisions in document_revisions
. The "original" url
also contains a base64 encoded id
param, which is just the key
as a base64 string.
The current approach I'm taking, has me:
read the values of url
and key
from the media_items
table (in the fiddle) into a item record
variable (which then get iterated upon).
I then load the url
value into a old_url
variable (to re-use it during "searches")
old_url := "https://my.old.url/api/v1/images/4166289-f-542-d-445-b-a-0-d-1-ec-7916-c-0748-e.jpeg?id=MQkvMjAyMi8wMS8yMC80MTY2Mjg5Zi01NDJkLTQ0NWItYTBkMS1lYzc5MTZjMDc0OGUuanBlZwk4OTY%3D"
new_url
variable with the new value, which is something like:new_url := CONCAT('https://my.new.url', '/', '"' , item.key, '"')
-- output: https://my.new.url/2021/1/29/04aaf6f1-bf13-4918-a062-355bdf192d2f.jpeg
media_items.asset->'url'
entry to new_url
AND look for the value of old_url
in ALL revisions in document_revisions
:do $$
-- (...)
-- Update the media_items asset->'url'
UPDATE media_items mle
SET asset = REPLACE(
(mle.asset)::TEXT,
CONCAT('"url": ', old_url
CONCAT('"url": ', '"', new_url, '"')
)::JSONB;
-- Update the document_revisions
UPDATE document_revisions dr
SET data = REPLACE(
(dr.data)::TEXT,
CONCAT('"url": ', old_url),
CONCAT('"url": ', '"', new_url, '"')
)::JSONB;
UPDATE document_revisions dr
SET data = REPLACE(
(dr.data)::TEXT,
CONCAT('"originalUrl": ', old_url),
CONCAT('"originalUrl": ', '"', new_url, '"')
)::JSONB;
-- (...)
end; $$;
Now, as you can see, this is not a very good/performant approach as I'm relying on parsing JSON to text, finding a certain substring (which represents a JSON key: value
) and then replacing it with the new value.
It is, however, fairly simple to target the url
key in the media_items
table, so I could simply change to a more directed approach:
-- A much better solution to the media_items update above
UPDATE
media_items mle
SET
asset = JSONB_SET(asset, '{url}', CONCAT('"', new_url, '"')::JSONB)
WHERE
item.id = mle.id;
On the other hand, the same is NOT true for the revisions in document_revisions
, as the url
and originalUrl
can be at any depth level of json objects in the document_revisions.data->'data'->'content'
item array (look at the sample data in the gist, it's pretty complete even though text components have been severely shrunk of text for the sake of readability). I'm not sure how to get the array index of the json object to update and update it's property/key
I did find a couple possible solutions that used jsonb operators, but I'm not sure how to use them to achieve my intended result:
-- This will return all the revisions which contain an object, at any depth, whose key is `url`.
SELECT dr.*
FROM document_revisions dr
WHERE (dr.data->'content')::jsonb @? 'strict $.** ? (exists (@."url"))';
-- This will return all the revisions that contain an json value equal to the defined url
SELECT dr.*
FROM document_revisions dr
WHERE (dr.data->'content')::JSONB @? 'strict $.** ? (@ == "https://my.old.url/api/v1/images/covid-19-m-rna-vaccines.png?id=MQkvMjAyMS82LzEvZWI2YzQwZjItMjYzMS00MTY4LWFhZGQtNjZjMmVkOTg5ZDQ2LnBuZwk4MTQ=")';
I tried combining the two options (meaning, find by key AND value) with something to the likes of the next example, but that doesn't work, nor have I found an iteration of this that works:
SELECT dr.*
FROM document_revisions dr
WHERE (dr.data->'content')::jsonb @? 'strict $.** ? (exists (@."url") && @ == "https://server.livingdocs.io/api/v1/images/a.jpg?id=MQkvMjAyMC8xMS8yNS9mNjgyN2E5ZC0xZjViLTQzZGUtOGNiNy1iY2RhNjgyYTFhMjMuanBlZwk4OTY=")';
The only simple way I've found so far to do this is by doing string replacements, but this is slow and ineffective. It took me over 2 hours to update 150 media assets, which can then correspond to any amount of revisions between 1 and N. I currently have 6k assets and 23k revisions, where each revision can use N assets, so it's not being good...
Now, my questions are:
Thanks
Based on @Ajax1234 answer (thank you) the final solution ended up being a set of functions that are available in this dbfiddle
A possible method for handling arbitrarily nested data is to find all the JSON paths in document_revisions.data
that point to the value of url
or originalUrl
keys and then use a function with a loop
to repeatedly update document_revisions
in place.
First, update media_items
:
update media_items set asset = asset || jsonb_build_object('new_url', 'https://my.new.url/'||((asset -> 'key')#>>'{}'));
Then, create a cte
that will return all paths for every value in each data
:
create or replace function document_paths() returns table(id text, path text, arr_js jsonb, obj_js jsonb)
language sql
as $$
with recursive cte(id, path, arr_js, obj_js) as (
select d.id, '', case when jsonb_typeof(d.data) = 'array' then d.data
else '[null]'::jsonb end,
case when jsonb_typeof(d.data) = 'object' then d.data
else '{"null":null}'::jsonb end
from document_revisions d
union all
select c.id, c.path ||(case when c.path != '' then ',' else '' end)||(case when arr_js.elem#>>'{}' is null then obj_js.key else arr_js.i::text end),
case when arr_js.elem#>>'{}' is null then
(case when jsonb_typeof(obj_js.value) = 'array' then obj_js.value
else '[null]'::jsonb end) else
(case when jsonb_typeof(arr_js.elem) = 'object' then '[null]'::jsonb when jsonb_typeof(arr_js.elem) = 'array' then arr_js.elem else '[]'::jsonb end) end,
case when obj_js.value#>>'{}' is null then
(case when jsonb_typeof(arr_js.elem) = 'object' then arr_js.elem
else '{"null":null}'::jsonb end) else
(case when jsonb_typeof(obj_js.value) = 'object' then obj_js.value
else (case when jsonb_typeof(obj_js.value) = 'array' then '{"null":null}'::jsonb
else '{}'::jsonb end) end) end
from cte c
cross join lateral (select row_number() over (order by 1) - 1 i, t.value elem
from jsonb_array_elements(c.arr_js) t) arr_js
cross join jsonb_each(c.obj_js) obj_js
)
select * from cte
$$
Lastly, create a function (this can also be done in a procedure) to perform the updates:
create or replace function run_updates()
returns void as
$body$
declare
c record;
begin
for c in select * from document_paths() -- loop over each path
loop
if c.obj_js = '{}'::jsonb and (c.path ~ 'url' or c.path ~ 'originalUrl') then -- check if path is pointing to a url or originalUrl
-- perform update
update document_revisions set data = jsonb_set(data, ('{'||c.path||'}')::text[], m.asset -> 'new_url')
from media_items m where document_revisions.id = c.id and
((data::jsonb#>('{'||c.path||'}')::text[])#>>'{}') like ((m.asset -> 'url')#>>'{}')||'%';
end if;
end loop;
end
$body$
language plpgsql
Explanation of update
query:
update document_revisions set data = jsonb_set(data, ('{'||c.path||'}')::text[],
m.asset -> 'new_url') -- using the path to the url in the document, update the url with the new_url in media_items
from media_items m -- join previously updated media_items to document_revisions
-- if there is a basic substring match (you might want to change the exact match conditions later on) between the old_url in media_values and the url found in document_revisions via the path
-- then perform the update above
where document_revisions.id = c.id and
((data::jsonb#>('{'||c.path||'}')::text[])#>>'{}') like ((m.asset -> 'url')#>>'{}')||'%'
A more memory efficient approach can be to find all url
and originalUrl
s from document_revisions
and then perform an in-place series of updates in a function. In this way, data
is not being repeatedly loaded into memory all at once, as it is for the solution above.
First, update media_items
:
update media_items set asset = asset || jsonb_build_object('new_url',
'https://my.new.url/'||((asset -> 'key')#>>'{}'));
Then, using a procedure to find all url
s in document_revisions
and then performing the replacements:
create or replace function run_updates()
returns void as
$body$
declare
c record;
begin
for c in (select t.id, url from (select d.id,
regexp_matches(d.data::text, '(?:(?<=url"\:)(?:\s+)*"([^"]+))|(?:(?<=Url"\:)(?:\s+)*"([^"]+))', 'g') v
from document_revisions d) t cross join unnest(t.v) url
where url is not null)
loop
update document_revisions set data = replace(data::text, c.url, ((m.asset -> 'new_url')#>>'{}'))::jsonb from media_items m where document_revisions.id = c.id and
c.url like ((m.asset -> 'url')#>>'{}')||'%';
end loop;
end
$body$
language plpgsql
As with the previous solutions, a replacement of url
or originalUrl
with media_items.new_url
only occurs when the latter matches as a substring of the former:
c.url like ((m.asset -> 'url')#>>'{}')||'%'