Search code examples
pythonjsonyajl

Using json.loads v/s yajl.loads for reading a large JSON file in python


I am working with huge JSON files, with size ranging from 100-300 MB. Hence in order to save disk space (and computation time?), I converted the JSON file into a .json.gz file and proceeded like this :

with gzip.GzipFile(json_file, 'r') as f:
    return json.loads(f.read().decode('utf-8'))

json.loads didn't cause any issues with memory usage, but I would like to increase the speed, and hence I tried py-yajl (not to be confused with yajl-py, which I tried as well, but that took much more time since I was parsing the streamed JSON), like this :

yajl.loads(f.read().decode('utf-8'))

But as I have seen on sites claiming that yajl is faster than json and simplejson libraries, I couldn't see an improvement in execution time. On the contrary, it took a bit more time as compared to json. Am I missing something here? In what cases, yajl is supposed to be faster than json/simplejson? Does the speed depend on the structure of a JSON file as well?

My JSON file looks like this :

[
    {
        "bytes_sent": XXX,
        "forwardedfor": "-",
        "hostip": "XXX",
        "hostname": "XXX",
        "https": "on",
        "landscapeName": "XXX",
    },
    ...
]

I am aware that this is a subjective question and is likely to be closed, but I couldn't clear my doubts anywhere, and, at the same time, I would like to know about the difference b/w these libraries in more detail, hence asking here.


Solution

  • If you are reading the entire structure into memory at once anyway, the external library offers no benefits. The motivation for a tool like that is that it allows you to process things piecemeal, without having to load the entire thing into memory first, or at all. If your JSON is a list of things, process one thing at a time, via the callbacks the library offers.