java performance nio mmap memory-mapped-files

MappedByteBuffer.asFloatBuffer() vs. in-memory float[] performance

Let's say you are doing some computation over a large set of large float vectors, e.g. calculating the average of each:

public static float avg(float[] data, int offset, int length) {
  float sum = 0;
  for (int i = offset; i < offset + length; i++) {
    sum += data[i];
  }
  return sum / length;
}

If you have all your vectors stored in an in-memory float[], you can implement the loop as this:

float[] data; // <-- vectors here
float sum = 0;
for (int i = 0; i < nVectors; i++) {
  sum += avg(data, i * vectorSize, vectorSize);
}

If your the vectors are stored in a file instead, memory-mapping it should as fast as the first solution, in theory, once the OS has cached the whole thing:

RandomAccessFile file; // <-- vectors here
MappedByteBuffer buffer = file.getChannel().map(READ_WRITE, 0, 4*data.length);
FloatBuffer floatBuffer = buffer.asFloatBuffer();
buffer.load(); // <-- this forces the OS to cache the file

float[] vector = new float[vectorSize];
float sum = 0;
for (int i = 0; i < nVectors; i++) {
  floatBuffer.get(vector);
  sum += avg(vector, 0, vector.length);
}

However, my tests show that the memory-mapped version is ~5 times slower than the in-memory one. I know that FloatBuffer.get(float[]) is copying memory, and I guess that's the reason for the slowdown. Can it get any faster? Is there a way to avoid any memory copying at all and just get my data from the OS' buffer?

I've uploaded my full benchmark to this gist, in case you want to try it just run:

$ java -Xmx1024m ArrayVsMMap 100 100000 100

Edit:

In the end, the best I have been able to get out of a MappedByteBuffer in this scenario is still slower than using a regular float[] by ~35%. The tricks so far are:

use the native byte order to avoid conversion: buffer.order(ByteOrder.nativeOrder())
wrap the MappedByteBuffer with a FloatBuffer using buffer.asFloatBuffer()
use the simple floatBuffer.get(int index) instead of the bulk version, this avoids memory copying.

You can see the new benchmark and results at this gist.

A slowdown of 1.35 is much better than one of 5, but it's still far from 1. I'm probably still missing something, or else it's something in the JVM that should be improved.

Solution

Your array-based time is ridiculously fast! I get .0002 nanoseconds per float. The JVM is probably optimizing the loop away.

This is the problem:

    void iterate() {
        for (int i = 0; i < nVectors; i++) {
            calc(data, i * vectorSize, vectorSize);
        }
    }

The JVM realizes that calc has no side effects, so iterate doesn't either, so it can just be replaced with a NOP. A simple fix is to accumulate the results from calc and return it. You also need to do the same with the results of iterate in the timing loop, and print the result. That prevents the optimizer from deleting all of your code.

Edit:

This looks like it is probably just overhead on the Java side, nothing to do with memory mapping itself, just the interface to it. Try the following test, which just wraps a FloatBuffer around a ByteBuffer around a byte[]:

  private static final class ArrayByteBufferTest extends IterationTest {
    private final FloatBuffer floatBuffer;
    private final int vectorSize;
    private final int nVectors;

    ArrayByteBufferTest(float[] data, int vectorSize, int nVectors) {
      ByteBuffer bb = ByteBuffer.wrap(new byte[data.length * 4]);
      for (int i = 0; i < data.length; i++) {
        bb.putFloat(data[i]);
      }
      bb.rewind();
      this.floatBuffer = bb.asFloatBuffer();
      this.vectorSize = vectorSize;
      this.nVectors = nVectors;
    }

    float iterate() {
      float sum = 0;
      floatBuffer.rewind();
      float[] vector = new float[vectorSize];
      for (int i = 0; i < nVectors; i++) {
        floatBuffer.get(vector);
        sum += calc(vector, 0, vector.length);
      }
      return sum;
    }
  }

Since you're doing so little work on the float itself (just adding it, probably 1 cycle), the cost of reading 4 bytes, building a float, and copying it to an array all add up. I noticed it helps the overhead a bit to have fewer, bigger vectors, at least until the vector is bigger than (L1?) cache.