Search code examples
pythonjsonperformancejq

Optimising object retrieval from a large JSON array using jq


I need to retrieve an object at a specific index from a massive JSON array. The array contains 2,000,000 objects and the file size is around 5GB.

I've experimented with various approaches using jq in combination with Python, but performance remains an issue. Here are some of the methods I've tried:

  1. Direct indexing:

    jq -c '.[100000]' Movies.json
    
  2. Slurping and indexing:

    jq --slurp '.[0].[100000]' Movies.json
    
  3. Using nth():

    jq -c 'nth(100000; .[])' Movies.json
    

While these methods seem to work, they are too slow for my requirements. I've also tried using streams, which significantly improves performance:

jq -cn --stream 'nth(100000; fromstream(1|truncate_stream(inputs)))' Movies.json

However, as the index increases, so does the retrieval time, which I suspect is due to how streaming operates.

I understand that one option is to divide the file into chunks, but I'd rather avoid creating additional files by doing so.

JSON structure example:

[
    {
        "Item": {
            "Name": "Darkest Legend",
            "Year": 1992,
            "Genre": ["War"],
            "Director": "Sherill Eal Eisenberg",
            "Producer": "Arabella Orth",
            "Screenplay": ["Octavia Delmer"],
            "Cast": ["Johanna Azar", "..."],
            "Runtime": 161,
            "Rate": "9.0",
            "Description": "Robin Northrop Cymbre",
            "Reviews": "Gisela Seumas"
        },
        "Similars": [
            {
                "Name": "Smooth of Edge",
                "Year": 1985,
                "Genre": ["Western"],
                "Director": "Vitoria Eustacia",
                "Producer": "Auguste Jamaal Corry",
                "Screenplay": ["Jaquenette Lance Gibe"],
                "Cast": ["Althea Nicole", "..."],
                "Runtime": 96,
                "Rate": "6.5",
                "Description": "Ashlan Grobe",
                "Reviews": "Annnora Vasquez"
            }
        ]
    },
    ...
]

How could I improve the efficiency of object retrieval from such a large array?


Solution

  • If you don't want to create an intermediary file or database (e.g. along the lines suggested by @pmf) but want something that's faster than jq --stream, then consider the "JSON Machine"-based command-line program named jm (or the similar Python-based program named jm.py). Assuming input.json contains a JSON array and you just want the 1000th item, you could write:

    jm --limit 1000 input.json | sed -n '$p'
    

    or

    jm.py --limit 1000 input.json | sed -n '$p'
    

    The reason these should generally be (slightly?) faster than using jq --stream is that JSON Machine does precisely what is needed here: minimally parse the top-level array with a view to "splatting" it.


    Here is a performance summary (in increasing order of u+s times) for a file, 1e7.json, consisting of an array of 1e7 objects:

                                           u+s(secs)         mrss
    jm.py --limit 100000 | sed -n '$p'     1.16          15163392
    jm --limit 100000 | sed -n '$p'        1.9           15532032
    
    gojq -n --stream ...skimpy...          2.54          10375168
    jq -n --stream ...skimpy...            3.3            1826816
    
    gojq -n --stream ...                   4.47         189091840
    jq -n --stream ...                     5.41           1843200
    
    jsonpointer /99999                    17.8         4636860416
    

    where ... is the jq program:

    nth(99999; fromstream( inputs|(.[0] |= .[1:]) | select(. != [[]]) ))
    

    and the skimpy program is:

    nth(99999; fromstream(1|tostream(inputs)))
    

    For reference:

    $ /usr/bin/time -lp jq length 1e7.json
    10000000
    user 44.65
    sys 9.38
              5196390400  maximum resident set size
    

    Disclosure: I am the author of the jm and jm.py scripts.