Search code examples
jsonstreambigdatajq

jq extract one subtree from huge (10 GB) JSON file via streaming


I have a database dump that consists of one huge JSON tree. I want to extract a specific subtree that will be much smaller than the rest, with a known specific key.

{ "key1": { subtree1... }, "key2": { subtree2... }, ... }

How do I extract subtreeN with streaming jq?


Solution

  • In the following, we'll assume $key holds the key of interest.

    The key to efficiency here is to terminate once the processing of the stream produced by the --stream option completes handling the $key key. To do so, we can define a helper function as follows. Notice that it uses inputs, and hence the invocation of jq must use the -n command-line option.

    # break out early
    def filter($key):
      label $out
      | foreach inputs as $in ( null;
          if . == null
          then if $in[0][0] == $key then $in
               else empty
               end
          elif $in[0][0] != $key then break $out
          else $in
          end;
          select(length==2) );
    

    The reconstruction of the desired key-value pair can now be accomplished as follows:

    reduce filter($key) as $in ({};
      setpath($in[0]; $in[1]) )
    

    Example input.json

    {
      "key1": {
        "subtree1": {
        "a": {"aa":[1,2,3]}
        }
      },
      "key2": {
        "subtree2": {
            "b1":  {"bb":[11,12,13]},
            "b2":  {"bb":[11,12,13]}
        }
      },
      "key3": {
        "subtree3": {
          "c":  {"cc":[21,22,23]}
        }
      }
    }
    

    Illustration

    jq -n -c --arg key "key2" --stream -f extract.jq input.json
    

    Output

    {"key2":{"subtree2":{"b1":{"bb":[11,12,13]},"b2":{"bb":[11,12,13]}}}}