Search code examples
javajsonperformanceparallel-processing

How to improve Performance in JSON parsing


I have a scenario where user is going to upload a zip file. This zip file can have 4999 json files, each json file can have 4999 nodes which I am parsing and creating objects. Eventually I am inserting them in db. When I tested this scenario it took me 30-50 min to parse.

I am looking for suggestions where

  1. I want to read JSON files in parallel: let's say if I have a batch of 100 jsonfiles then I can have 50 threads running in parallel

  2. Each thread will be responsible for parsing the JSON files, which might result in another perf bottleneck as we have 4999 nodes to parse. So I was thinking another batch of 100 node reads at a time which will cause 50 child threads again

So in total there will be 2500 threads in the system but should help parallel execution of around 25,000,000 sequential operations.

Let me know if this approach sounds fine or not?


Solution

  • What you described should not take so much time (30-50 min to parse), also a json file with ~5k nodes is relatively small. The bottleneck will be in database, during mass insert, especially if you have indexes on fields.

    So i suggest to:

    1. Don't waste time on threading - unpacking and parsing jsons should be fast in your case, focus on batch inserts and do it properly: 1000+ batch queue and manual commit after.
    2. Disable indexes before importing, especially full-text and enable (+reindex) after