I am working with huge JSON files, with size ranging from 100-300 MB. Hence in order to save disk space (and computation time?), I converted the JSON file into a .json.gz
file and proceeded like this :
with gzip.GzipFile(json_file, 'r') as f:
return json.loads(f.read().decode('utf-8'))
json.loads
didn't cause any issues with memory usage, but I would like to increase the speed, and hence I tried py-yajl
(not to be confused with yajl-py, which I tried as well, but that took much more time since I was parsing the streamed JSON), like this :
yajl.loads(f.read().decode('utf-8'))
But as I have seen on sites claiming that yajl
is faster than json
and simplejson
libraries, I couldn't see an improvement in execution time. On the contrary, it took a bit more time as compared to json
. Am I missing something here? In what cases, yajl
is supposed to be faster than json/simplejson
? Does the speed depend on the structure of a JSON file as well?
My JSON file looks like this :
[
{
"bytes_sent": XXX,
"forwardedfor": "-",
"hostip": "XXX",
"hostname": "XXX",
"https": "on",
"landscapeName": "XXX",
},
...
]
I am aware that this is a subjective question and is likely to be closed, but I couldn't clear my doubts anywhere, and, at the same time, I would like to know about the difference b/w these libraries in more detail, hence asking here.
If you are reading the entire structure into memory at once anyway, the external library offers no benefits. The motivation for a tool like that is that it allows you to process things piecemeal, without having to load the entire thing into memory first, or at all. If your JSON is a list of things, process one thing at a time, via the callbacks the library offers.