Measuring performance of java.io.InputStream

I have a file 5GB in size which I want to read by chunks, say 2MB. Using java.io.InputStream works fine. So I measured this thing as follows:

static final byte[] buffer = new byte[2 * 1024 * 1024];

public static void main(String args[]) throws IOException {
    while(true){
        InputStream is = new FileInputStream("/tmp/log_test.log");
        long bytesRead = 0;
        int readCurrent;
        long start = System.nanoTime();
        while((readCurrent = is.read(buffer)) > 0){
            bytesRead += readCurrent;
        }
        long end = System.nanoTime();
        System.out.println(
            "Bytes read = " + bytesRead + ". Time elapsed = " + (end - start)
        );
    }
}

RESULT = 2121714428

It can be seen that averagely it takes 2121714428 nanos. It is so because the implementation does (*env)->SetByteArrayRegion(env, bytes, off, nread, (jbyte *)buf); of the data read into a malloced or stack allocated buffer as shown here. So memcpy takes pretty large amount of CPU time:

Since JNI spec defines that

Inside a critical region, native code must not call other JNI functions, or any system call that may cause the current thread to block and wait for another Java thread. (For example, the current thread must not call read on a stream being written by another Java thread.)

I don't see any problems to do read from a regular file within a critical section. Reading from a regular file is blocked only briefly and does not depend on any java thread. Something like this:

static final byte[] buffer = new byte[2 * 1024 * 1024];

public static void main(String args[]) throws IOException {
    while (true) {
        int fd = open("/tmp/log_test.log");
        long bytesRead = 0;
        int readCurrent;
        long start = System.nanoTime();
        while ((readCurrent = read(fd, buffer)) > 0) {
            bytesRead += readCurrent;
        }
        long end = System.nanoTime();
        System.out.println("Bytes read = " + bytesRead + ". Time elapsed = " + (end - start));
    }
}

private static native int open(String path);

private static native int read(int fd, byte[] buf);

JNI functions:

JNIEXPORT jint JNICALL Java_com_test_Main_open
  (JNIEnv *env, jclass jc, jstring path){
    const char *native_path = (*env)->GetStringUTFChars(env, path, NULL);
    int fd = open(native_path, O_RDONLY);
    (*env)->ReleaseStringUTFChars(env, path, native_path);
    return fd;
}


JNIEXPORT jint JNICALL Java_com_test_Main_read
  (JNIEnv *env, jclass jc, jint fd, jbyteArray arr){
    size_t java_array_size = (size_t) (*env)->GetArrayLength(env, arr);
    void *buf = (*env)->GetPrimitiveArrayCritical(env, arr, NULL);
    ssize_t bytes_read = read(fd, buf, java_array_size);
    (*env)->ReleasePrimitiveArrayCritical(env, arr, buf, 0);
    return (jint) bytes_read;
}

RESULT = 1179852225

Runnning this in a loop it takes averagely 1179852225 nanos which is almost twice more efficient.

Question: What's the actual problem with reading from a regular file within critical section?

Solution

2MB buffer with FileInputStream is probably not the best choice. See this question for details. Although it was on Windows, I've seen a similar performance issue on Linux. Depending on the OS, allocating a temporary large buffer may result in extra mmap calls and subsequent page faults. Also such a large buffer makes L1/L2 caches useless.

Reading from a regular file is blocked only briefly and does not depend on any java thread.

This is not always true. In your benchmark the file is apparently cached in OS page cache and no device I/O happens. Accessing the real hardware (especially a spinning disk) can be orders of magnitude slower. The worst time of disk I/O is not fully predictable - it can be as large as hundreds of milliseconds, depending on the hardware condition, the length of I/O queue, the scheduling policy and so on.

The problem with JNI critical section is whenever a delay happens, it may affect all threads, not only the one doing I/O. This is not an issue for a single-threaded application, but this may cause undesirable stop-the-world pauses in a multi-threaded app.

The other reason against JNI critical is JVM bugs related to GCLocker. Sometimes they may cause redundant GC cycles or ignoring certain GC flags. Here are some examples (still not fixed):

JDK-8048556 Unnecessary GCLocker-initiated young GCs
JDK-8057573 CMSScavengeBeforeRemark ignored if GCLocker is active
JDK-8057586 Explicit GC ignored if GCLocker is active

So, the question is whether you care about throughput or latency. If you need only higher throughput, JNI critical is probably the right way to go. However, if you also care about predictable latency (not the average latency, but say, 99.9%) then JNI critical does not seem like the good choice.