Search code examples
javaperformanceiofileinputstream

Why is it that FileInputStream read is slower with bigger array


If I read bytes from a file into a byte[] I see that FileInputStream performance worse when the array is around 1 MB compared to 128 KB. On the 2 workstations I have tested it is almost twice as fast with 128 KB. Why is that?

import java.io.*;

public class ReadFileInChuncks 
{
    public static void main(String[] args) throws IOException 
    {
        byte[] buffer1 = new byte[1024*128];
        byte[] buffer2 = new byte[1024*1024];

        String path = "some 1 gb big file";

        readFileInChuncks(path, buffer1, false);

        readFileInChuncks(path, buffer1, true);
        readFileInChuncks(path, buffer2, true);
        readFileInChuncks(path, buffer1, true);
        readFileInChuncks(path, buffer2, true);
    }

    public static void readFileInChuncks(String path, byte[] buffer, boolean report) throws IOException
    {
        long t = System.currentTimeMillis();

        InputStream is = new FileInputStream(path);
        while ((readToArray(is, buffer)) != 0) {}

        if (report)
            System.out.println((System.currentTimeMillis()-t) + " ms");
    }

    public static int readToArray(InputStream is, byte[] buffer) throws IOException
    {
        int index = 0;
        while (index != buffer.length)
        {
            int read = is.read(buffer, index, buffer.length - index);
            if (read == -1)
                break;
            index += read;
        }
        return index;
    }
}

outputs

422 ms 
717 ms 
422 ms 
718 ms

Notice this is a redefinition of an already posted question. The other was polluted with unrelated discussions. I will mark the other for deletion.

Edit: Duplicate, really? I sure could make some better code to proof my point, but this does not answer my question

Edit2: I ran the test with every buffer between 5 KB and 1000 KB on
Win7 / JRE 1.8.0_25 and the bad performance starts at precis 508 KB and all subsequent. Sorry for the bad diagram legions, x is buffer size, y is milliseconds

enter image description here


Solution

  • TL;DR The performance drop is caused by memory allocation, not by file reading issues.

    A typical benchmarking problem: you benchmark one thing, but actually measure another.

    First of all, when I rewrote the sample code using RandomAccessFile, FileChannel and ByteBuffer.allocateDirect, the threshold has disappeared. File reading performance became roughly the same for 128K and 1M buffer.

    Unlike direct ByteBuffer I/O FileInputStream.read cannot load data directly into Java byte array. It needs to get data into some native buffer first, and then copy it to Java using JNI SetByteArrayRegion function.

    So we have to look at the native implementation of FileInputStream.read. It comes down to the following piece of code in io_util.c:

        if (len == 0) {
            return 0;
        } else if (len > BUF_SIZE) {
            buf = malloc(len);
            if (buf == NULL) {
                JNU_ThrowOutOfMemoryError(env, NULL);
                return 0;
            }
        } else {
            buf = stackBuf;
        }
    

    Here BUF_SIZE == 8192. If the buffer is larger than this reserved stack area, a temporary buffer is allocated by malloc. On Windows malloc is usually implemented via HeapAlloc WINAPI call.

    Next, I measured the performance of HeapAlloc + HeapFree calls alone without file I/O. The results were interesting:

         128K:    5 μs
         256K:   10 μs
         384K:   15 μs
         512K:   20 μs
         640K:   25 μs
         768K:   29 μs
         896K:   33 μs
        1024K:  316 μs  <-- almost 10x leap
        1152K:  356 μs
        1280K:  399 μs
        1408K:  436 μs
        1536K:  474 μs
        1664K:  511 μs
        1792K:  553 μs
        1920K:  592 μs
        2048K:  628 μs
    

    As you can see, the performance of OS memory allocation drastically changes at 1MB boundary. This can be explained by different allocation algorithms used for small chunks and for large chunks.

    UPDATE

    The documentation for HeapCreate confirms the idea about specific allocation strategy for blocks larger than 1MB (see dwMaximumSize description).

    Also, the largest memory block that can be allocated from the heap is slightly less than 512 KB for a 32-bit process and slightly less than 1,024 KB for a 64-bit process.

    ...

    Requests to allocate memory blocks larger than the limit for a fixed-size heap do not automatically fail; instead, the system calls the VirtualAlloc function to obtain the memory that is needed for large blocks.