Search code examples
jsonpathbigdatajq

Include path of object when streaming out JSON from JQ


I've got the following (100GB+) JSON file which includes the following top level key:

{
   "data": {
      "<userId1>": {
         "things": {
            "<thingId1>": {
               "subfield1": "blah1",
               "subfield3": "foobar",
            },
            "<thingId2>": {
               "subfield2": "blah2"
            }
         }
      },
      "<userId2>": {
         "things": {
            "<thingId1>": {
               "subfield4": "blah3"
            },
            "<thingId2>": {
               "subfield3": "blah4"
            }
         }
      }
   }
}

I'd like each things object outputted with its path on a new line (for later newline delimited JSON usage down the pipeline), e.g. with the above it would output:

{ "path": "data/<userId1>/things/<thingId1>", "value": { "subfield1": "blah1", "subfield3": "foobar" } }
{ "path": "data/<userId1>/things/<thingId2>", "value": { "subfield2": "blah2" } }
{ "path": "data/<userId2>/things/<thingId1>", "value": { "subfield4": "blah3" } }
{ "path": "data/<userId2>/things/<thingId2>", "value": { "subfield3": "blah4" } }

The separators in path can be . if needed but would be most useful if it was an actual string array of keys.

It also needs to be a bit generic as I'd also like the simpler case of:

{
   "users": {
      "<userId1>": {
         "name": "user1",
         "email": "[email protected]"
      },
      "<userId2>": {
         "name": "user2"
      },
      "<userId3>": {
         "email": "[email protected]"
      }
   }
}

To produce the output of:

{ "path": "users/<userId1>", "value": { "name": "user1", "email": "[email protected]" } }
{ "path": "users/<userId2>", "value": { "name": "user2" } }
{ "path": "users/<userId3>", "value": { "email": "[email protected]" } }

For the simpler case, I've managed to get JQ to output the key of each object with:

echo '{"users":{"<userId1>":{"name":"user1"},"<userId2>":{"name":"user2"}}}' | jq -rn --stream 'fromstream(1 | truncate_stream(inputs | select(.[0][0] == "users"))) | to_entries | map("\(. | tojson)") | join("\n")'

And while that works, I can't quite work out how to adapt it to output paths for the nested objects in the more complex example.

Also JQ-wise I don't know if piping the output of fromstream to more functions is realistic memory wise? We're talking about 10s of millions of users/data sub objects and possibly a similar amount of things sub objects.


Solution

  • You could still --stream the input but write your own interpreter that extracts and processes the parts needed. Here's an example that collects the content parts (having a value at position 1) into a result object (using setpath). The original path is split at two levels below the deepest occurrence of either things or users (or ignored, if neither was present), with the lower part being used for setpath, and the higher part being compared to the previously processed items's higher part path. If they match, the next item is added, otherwise a new result object is started while the old, "finished" one is returned (and discarded from memory after the first item of the next result object is added).

    #!/usr/bin/env -S jq --stream -ncf
    
    foreach (
    
      ( inputs | select(has(1))
      | (first | [rindex("things", "users")] | max | values) as $p
      | [.[0][$p+2:], (.[0][:$p+2] | join("/")), .[1]]), [[]]
    
    ) as [$vpath, $path, $value] ([];
    
      if [first.path, $path] | unique[1]
      then [{$path}, first] else .[:1] end
      | first.value |= setpath($vpath; $value);
    
      .[1] | values
    )
    
    {"path":"data/<userId1>/things/<thingId1>","value":{"subfield1":"blah1","subfield3":"foobar"}}
    {"path":"data/<userId1>/things/<thingId2>","value":{"subfield2":"blah2"}}
    {"path":"data/<userId2>/things/<thingId1>","value":{"subfield4":"blah3"}}
    {"path":"data/<userId2>/things/<thingId2>","value":{"subfield3":"blah4"}}
    {"path":"users/<userId1>","value":{"name":"user1","email":"[email protected]"}}
    {"path":"users/<userId2>","value":{"name":"user2"}}
    {"path":"users/<userId3>","value":{"email":"[email protected]"}}
    

    Demo (faking the stream with the tostream filter added, as there's no --stream option in jqplay)