Search code examples
jsonlinuxunixjqjtc

Unnest a huge JSON array into individual JSON object


I have the a single JSON object as below,

{
    "someOtherArray": [ {} , {} ],
    "a": [
        {
            "item1": "item1_value",
            "item2": "item2_value"
        },
        {
            "item1": "item1_value",
            "item2": "item2_value"
        },
        {
            ....
        },
        
        100 million more object
    ]
}

I'm trying to make each element in the array as a separate JSON object as below,

{ "a": { "item1": "item1_value", "item2": "item2_value" } }
{ "a": { "item1": "item1_value", "item2": "item2_value" } }

The raw files has millions of nested objects in a single JSON array, which I want to split into multiple individual JSON.


Solution

  • This is a response to the revised question (i.e., "I just want 'a'").

    You could just tweak the standard answer:

    jq --stream -nc '
      {"a": fromstream(2|truncate_stream(inputs | select(.[0][0]=="a")) )}
    '
    

    Footnote: Execution Times

    The jq streaming parser is economical with memory at the expense of execution speed. If the input consists of an array of N small objects, then the execution time should very roughly be linear in N, and the memory requirements should be roughly constant.

    To give some idea of what to expect, I created an array of 10^8 objects similar to those described in the Q. The file size was 4GB. On a 3GHz machine, reading the file took about 16 minutes of u+s time, but the "peak memory footprint" was only 1.2MB.

    gojq was slightly slower but required significantly more memory, the "peak memory footprint" being 8.4MB, and I suspect that the required memory grows with N.