Optimising object retrieval from a large JSON array using jq

I need to retrieve an object at a specific index from a massive JSON array. The array contains 2,000,000 objects and the file size is around 5GB.

I've experimented with various approaches using jq in combination with Python, but performance remains an issue. Here are some of the methods I've tried:

Direct indexing:
```
jq -c '.[100000]' Movies.json
```
Slurping and indexing:
```
jq --slurp '.[0].[100000]' Movies.json
```
Using nth():
```
jq -c 'nth(100000; .[])' Movies.json
```

While these methods seem to work, they are too slow for my requirements. I've also tried using streams, which significantly improves performance:

jq -cn --stream 'nth(100000; fromstream(1|truncate_stream(inputs)))' Movies.json

However, as the index increases, so does the retrieval time, which I suspect is due to how streaming operates.

I understand that one option is to divide the file into chunks, but I'd rather avoid creating additional files by doing so.

JSON structure example:

[
    {
        "Item": {
            "Name": "Darkest Legend",
            "Year": 1992,
            "Genre": ["War"],
            "Director": "Sherill Eal Eisenberg",
            "Producer": "Arabella Orth",
            "Screenplay": ["Octavia Delmer"],
            "Cast": ["Johanna Azar", "..."],
            "Runtime": 161,
            "Rate": "9.0",
            "Description": "Robin Northrop Cymbre",
            "Reviews": "Gisela Seumas"
        },
        "Similars": [
            {
                "Name": "Smooth of Edge",
                "Year": 1985,
                "Genre": ["Western"],
                "Director": "Vitoria Eustacia",
                "Producer": "Auguste Jamaal Corry",
                "Screenplay": ["Jaquenette Lance Gibe"],
                "Cast": ["Althea Nicole", "..."],
                "Runtime": 96,
                "Rate": "6.5",
                "Description": "Ashlan Grobe",
                "Reviews": "Annnora Vasquez"
            }
        ]
    },
    ...
]

How could I improve the efficiency of object retrieval from such a large array?

Solution

If you don't want to create an intermediary file or database (e.g. along the lines suggested by @pmf) but want something that's faster than jq --stream, then consider the "JSON Machine"-based command-line program named jm (or the similar Python-based program named jm.py). Assuming input.json contains a JSON array and you just want the 1000th item, you could write:

jm --limit 1000 input.json | sed -n '$p'

jm.py --limit 1000 input.json | sed -n '$p'

The reason these should generally be (slightly?) faster than using jq --stream is that JSON Machine does precisely what is needed here: minimally parse the top-level array with a view to "splatting" it.

Here is a performance summary (in increasing order of u+s times) for a file, 1e7.json, consisting of an array of 1e7 objects:

                                       u+s(secs)         mrss
jm.py --limit 100000 | sed -n '$p'     1.16          15163392
jm --limit 100000 | sed -n '$p'        1.9           15532032

gojq -n --stream ...skimpy...          2.54          10375168
jq -n --stream ...skimpy...            3.3            1826816

gojq -n --stream ...                   4.47         189091840
jq -n --stream ...                     5.41           1843200

jsonpointer /99999                    17.8         4636860416

where ... is the jq program:

nth(99999; fromstream( inputs|(.[0] |= .[1:]) | select(. != [[]]) ))

and the skimpy program is:

nth(99999; fromstream(1|tostream(inputs)))

For reference:

$ /usr/bin/time -lp jq length 1e7.json
10000000
user 44.65
sys 9.38
          5196390400  maximum resident set size

Disclosure: I am the author of the jm and jm.py scripts.