I need to retrieve an object at a specific index from a massive JSON array. The array contains 2,000,000 objects and the file size is around 5GB.
I've experimented with various approaches using jq
in combination with Python, but performance remains an issue.
Here are some of the methods I've tried:
Direct indexing:
jq -c '.[100000]' Movies.json
Slurping and indexing:
jq --slurp '.[0].[100000]' Movies.json
Using nth()
:
jq -c 'nth(100000; .[])' Movies.json
While these methods seem to work, they are too slow for my requirements. I've also tried using streams, which significantly improves performance:
jq -cn --stream 'nth(100000; fromstream(1|truncate_stream(inputs)))' Movies.json
However, as the index increases, so does the retrieval time, which I suspect is due to how streaming operates.
I understand that one option is to divide the file into chunks, but I'd rather avoid creating additional files by doing so.
JSON structure example:
[
{
"Item": {
"Name": "Darkest Legend",
"Year": 1992,
"Genre": ["War"],
"Director": "Sherill Eal Eisenberg",
"Producer": "Arabella Orth",
"Screenplay": ["Octavia Delmer"],
"Cast": ["Johanna Azar", "..."],
"Runtime": 161,
"Rate": "9.0",
"Description": "Robin Northrop Cymbre",
"Reviews": "Gisela Seumas"
},
"Similars": [
{
"Name": "Smooth of Edge",
"Year": 1985,
"Genre": ["Western"],
"Director": "Vitoria Eustacia",
"Producer": "Auguste Jamaal Corry",
"Screenplay": ["Jaquenette Lance Gibe"],
"Cast": ["Althea Nicole", "..."],
"Runtime": 96,
"Rate": "6.5",
"Description": "Ashlan Grobe",
"Reviews": "Annnora Vasquez"
}
]
},
...
]
How could I improve the efficiency of object retrieval from such a large array?
If you don't want to create an intermediary file or database (e.g. along the lines suggested by @pmf) but want something that's faster than jq --stream
, then consider the "JSON Machine"-based command-line program named jm (or the similar Python-based program named jm.py
). Assuming input.json
contains a JSON array and you just want the 1000th item, you could write:
jm --limit 1000 input.json | sed -n '$p'
or
jm.py --limit 1000 input.json | sed -n '$p'
The reason these should generally be (slightly?) faster than using jq --stream
is that JSON Machine does precisely what is needed here: minimally parse the top-level array with a view to "splatting" it.
Here is a performance summary (in increasing order of u+s times) for a file, 1e7.json, consisting of an array of 1e7 objects:
u+s(secs) mrss
jm.py --limit 100000 | sed -n '$p' 1.16 15163392
jm --limit 100000 | sed -n '$p' 1.9 15532032
gojq -n --stream ...skimpy... 2.54 10375168
jq -n --stream ...skimpy... 3.3 1826816
gojq -n --stream ... 4.47 189091840
jq -n --stream ... 5.41 1843200
jsonpointer /99999 17.8 4636860416
where ...
is the jq program:
nth(99999; fromstream( inputs|(.[0] |= .[1:]) | select(. != [[]]) ))
and the skimpy program is:
nth(99999; fromstream(1|tostream(inputs)))
For reference:
$ /usr/bin/time -lp jq length 1e7.json
10000000
user 44.65
sys 9.38
5196390400 maximum resident set size
Disclosure: I am the author of the jm and jm.py scripts.