I've got the following (100GB+) JSON file which includes the following top level key:
{
"data": {
"<userId1>": {
"things": {
"<thingId1>": {
"subfield1": "blah1",
"subfield3": "foobar",
},
"<thingId2>": {
"subfield2": "blah2"
}
}
},
"<userId2>": {
"things": {
"<thingId1>": {
"subfield4": "blah3"
},
"<thingId2>": {
"subfield3": "blah4"
}
}
}
}
}
I'd like each things
object outputted with its path on a new line (for later newline delimited JSON usage down the pipeline), e.g. with the above it would output:
{ "path": "data/<userId1>/things/<thingId1>", "value": { "subfield1": "blah1", "subfield3": "foobar" } }
{ "path": "data/<userId1>/things/<thingId2>", "value": { "subfield2": "blah2" } }
{ "path": "data/<userId2>/things/<thingId1>", "value": { "subfield4": "blah3" } }
{ "path": "data/<userId2>/things/<thingId2>", "value": { "subfield3": "blah4" } }
The separators in path
can be .
if needed but would be most useful if it was an actual string array of keys.
It also needs to be a bit generic as I'd also like the simpler case of:
{
"users": {
"<userId1>": {
"name": "user1",
"email": "[email protected]"
},
"<userId2>": {
"name": "user2"
},
"<userId3>": {
"email": "[email protected]"
}
}
}
To produce the output of:
{ "path": "users/<userId1>", "value": { "name": "user1", "email": "[email protected]" } }
{ "path": "users/<userId2>", "value": { "name": "user2" } }
{ "path": "users/<userId3>", "value": { "email": "[email protected]" } }
For the simpler case, I've managed to get JQ to output the key
of each object with:
echo '{"users":{"<userId1>":{"name":"user1"},"<userId2>":{"name":"user2"}}}' | jq -rn --stream 'fromstream(1 | truncate_stream(inputs | select(.[0][0] == "users"))) | to_entries | map("\(. | tojson)") | join("\n")'
And while that works, I can't quite work out how to adapt it to output paths for the nested objects in the more complex example.
Also JQ-wise I don't know if piping the output of fromstream
to more functions is realistic memory wise? We're talking about 10s of millions of users
/data
sub objects and possibly a similar amount of things
sub objects.
You could still --stream
the input but write your own interpreter that extracts and processes the parts needed. Here's an example that collects the content parts (having a value at position 1
) into a result object (using setpath
). The original path is split at two levels below the deepest occurrence of either things
or users
(or ignored, if neither was present), with the lower part being used for setpath
, and the higher part being compared to the previously processed items's higher part path. If they match, the next item is added, otherwise a new result object is started while the old, "finished" one is returned (and discarded from memory after the first item of the next result object is added).
#!/usr/bin/env -S jq --stream -ncf
foreach (
( inputs | select(has(1))
| (first | [rindex("things", "users")] | max | values) as $p
| [.[0][$p+2:], (.[0][:$p+2] | join("/")), .[1]]), [[]]
) as [$vpath, $path, $value] ([];
if [first.path, $path] | unique[1]
then [{$path}, first] else .[:1] end
| first.value |= setpath($vpath; $value);
.[1] | values
)
{"path":"data/<userId1>/things/<thingId1>","value":{"subfield1":"blah1","subfield3":"foobar"}}
{"path":"data/<userId1>/things/<thingId2>","value":{"subfield2":"blah2"}}
{"path":"data/<userId2>/things/<thingId1>","value":{"subfield4":"blah3"}}
{"path":"data/<userId2>/things/<thingId2>","value":{"subfield3":"blah4"}}
{"path":"users/<userId1>","value":{"name":"user1","email":"[email protected]"}}
{"path":"users/<userId2>","value":{"name":"user2"}}
{"path":"users/<userId3>","value":{"email":"[email protected]"}}
Demo (faking the stream with the tostream
filter added, as there's no --stream
option in jqplay)