Search code examples
javafile-ioiobinaryfiles

Java read huge file ( ~100GB ) efficiently


I would like to read a huge binary file ( ~100GB ) efficiently in Java. I have to process each line of it . The line processing will be in separate threads. I don't want to load the whole file into memory. Does reading in chunks work? What will be the optimum buffer size? Any formula for that?


Solution

  • If this is a binary file, then reading in "lines" does not make a lot of sense.

    If the file is really binary, then use a BufferedInputStream and read bytes one at a time into byte[]. When you get to the byte that marks your end of "line", add the byte[] and the count of bytes in the line to a queue for you worker threads to process.

    And repeat.

    Tips:

    • Use a bounded buffer in case you can read lines faster than you can process them.
    • Recycle the byte[] objects to reduce garbage generation.

    If the file is (really) text, then you could use BufferedReader and the readLine() method instead of calling read().


    The above will give you reasonable performance. Depending on how much work has to be done to process each line, it may be good enough that there is no point optimizing the file reading. You can check this by profiling.

    If you profiling tells you that reading is the bottle-neck, then consider using NIO with ByteBuffer or CharBuffer. It is more complicated but potentially faster than read() or readLine().


    Does reading in chunks work?

    BufferedReader or BufferedInputStream both read in chunks, under the covers.

    What will be the optimum buffer size?

    That's probably not that important what the buffer size is. I'd make it a few KB or tens of KB.

    Any formula for that?

    No there isn't a formula for an optimum buffer size. It will depend on variables that you can't quantify.