JNA Fortran performance tuning

I'm wrapping a native code (mostly Fortran 77) using JNA. The output (i.e. the results) of the native function consits of a bunch of nested (custom) types/structs, which I map to corresponding Structure in JNA. These Structures mostly consist of an array of other Structures (so Structure A holds an array of Structure B, Structure B holds an array of structure C etc).

Using same benchmarking (mainly by logging time-differences) I've found that most of the time is not spent in the native code, but during mapping of JNA. Fortran subroutine call takes about 50ms, but total time is 250ms.

I've found that

.setAutoWrite(false) on our Structure reduces overhead by ~ factor of 2 (total execution time almost halfes)
Keeping (statically allocated) arrays as small as possible helps to keeps JNA overhead low
Changing DOUBLE PRECISION (double) to REAL (float) seems not to make any difference

Are there any further tricks to optimize JNA performance in our case? I know I could flatten down my structures to a 1D array of primitives and use direct mapping, but I try to avoid that (because it will be a pain to encode/decode these structures)

Solution

As noted in the JNA FAQ, direct mapping would be your best performance increase, but you've excluded that as an option. It also notes that the calling overhead for each native call is another performance hit, which you've partially addressed by changing setAutoWrite().

You also did mention flattening your structures to an array of primitives, but rejected that due to encoding/decoding complexity. However, moving in this direction is probably the next best choice, and it's possible that the biggest performance issue you're currently facing is a combination of JNA's Structure access using reflection and native reads. Oracle notes:

Because reflection involves types that are dynamically resolved, certain Java virtual machine optimizations can not be performed. Consequently, reflective operations have slower performance than their non-reflective counterparts, and should be avoided in sections of code which are called frequently in performance-sensitive applications.

Since you are here asking a performance-related question and using JNA Structures, I can only assume you're writing a "performance-sensitive application". Internally, the Structure does this:

for (StructField structField : fields().values()) {
    readField(structField);
}

which does a single Native read for each field, followed by this, which ends up using reflection under the hood.

setFieldValue(structField.field, result, true);

The moral of the story is that normally with Structures, generally each field involves a native read + reflection write, or a reflection read + native write.

The first step you can make without making any other changes is to setAutoSynch(false) on the structure. (You've already done half of this with the "write" version; this does both read and write.) From the docs:

For extremely large or complex structures where you only need to access a small number of fields, you may see a significant performance benefit by avoiding automatic structure reads and writes. If auto-read and -write are disabled, it is up to you to ensure that the Java fields of interest are synched before and after native function calls via readField(String) and writeField(String,Object). This is typically most effective when a native call populates a large structure and you only need a few fields out of it. After the native call you can call readField(String) on only the fields of interest.

To really go all out, flattening will possibly help a little more to get rid of any of the reflection overhead. The trick is making the offset conversions easy.

Some directions to go, balancing complexity vs. performance:

To write to native memory, allocate and clear a buffer of bytes (mem = new Memory(size); mem.clear(); or just new byte[size]), and write specific fields to the byte offset you determine using the value from Structure.fieldOffset(name). This does use reflection, but you could do this once for each structure and store a map of name to offset for later use.
For reading from native memory, make all your native read calls using a flat buffer to reduce the native overhead to a single read/write. You can cast that buffer to a Structure when you read it (incurring reflection for each field once) or read specific byte offsets per the above strategy.