I’m parsing a really huge JSON file of 1.4 TB (it’s a WikiData dump just in case). It’s so big that even simple lines counting takes forever even with the help of optimizations like this Number of lines in a file in Java In order to speed it up I’m going to split the task and run it using both different SSDs on my main machine (so I probably get some extra disk throughput) and other computers I have (maybe using Apache Spark).
And the question is how do I start reading a file from a random position? Skipping the lines is obviously not an option :). And I also would like to try to avoid a physical splitting of this file. It’s actually the easiest and most traffic/disk-space-efficient solution but I would like to explore alternatives for some corner use cases.
Basically speaking I do the following:
JsonParser jp = f.createParser(new File(inputFile));
while(jp.nextToken() != JsonToken.END_OBJECT) {
//Fancy stuff
}
Is there a way to quickly jump to line #20,000,000?
Your question assumes that your JSON has lines endings, which it most likely won't have. Such large files are likely stripped from all unneeded characters and line endings are certainly not needed in a JSON file.
You're already using the Jackson Streaming API which is good, because it's your only chance to process such a large file. While you can't seek to a certain line, you can seek to a certain (byte) position, using RandomAccessFile.html#seek(long). You need to "guessstimate" the position you want to jump to (based on the total file size). Since your seek will likely you put in a random position (e.g. inside a attribute value) you probably need to first use some custom parsing rules to find a valid starting point to let the JSON Streaming Parser start. Once you figured out when exactly in the JSON you are, you can use the parser as usual.