Why is this JOML (JVM) code so much faster than the equivalent GSL (C)?

I am attempting to optimize a small library for doing arithmetic on vectors.

To roughly check my progress, I decided to benchmark the performance of two popular vector arithmetic libraries written in two different languages, the GNU Scientific Library (GSL, C), and the Java OpenGL Math Library (JOML, JVM). I expected GSL, as a large project written in C and compiled ahead of time, to be significantly faster than JOML, with extra baggage from object management, method calls, and conforming to the Java specifications.

Surprisingly, instead JOML (JVM) ended up being around 4X faster than GSL (C). I wish to understand why this is the case.

The benchmark I performed was to compute 4,000,000 iterations of Leibniz's formula to calculate Pi, in chunks of 4 at a time via 4-dimensioned vectors. The exact algorithm doesn't matter, and doesn't have to make sense. This was just the first and simplest thing I thought of that would let me use multiple vector operations per iteration.

This is the C code in question:

#include <stdio.h>
#include <time.h>
#include <gsl/gsl_vector.h>
#include <unistd.h>
#include <math.h>
#include <string.h>

#define IT 1000000

double pibench_inplace(int it) {
    gsl_vector* d = gsl_vector_calloc(4);
    gsl_vector* w = gsl_vector_calloc(4);
    for (int i=0; i<4; i++) {
        gsl_vector_set(d, i, (double)i*2+1);
        gsl_vector_set(w, i, (i%2==0) ? 1 : -1);
    }
    gsl_vector* b = gsl_vector_calloc(4);
    double pi = 0.0;
    for (int i=0; i<it; i++) {
        gsl_vector_memcpy(b, d);
        gsl_vector_add_constant(b, (double)i*8);
        for (int i=0; i<4; i++) {
            gsl_vector_set(b, i, pow(gsl_vector_get(b, i), -1.));
        }
        gsl_vector_mul(b, w);
        pi += gsl_vector_sum(b);
    }
    return pi*4;
}

double pibench_fast(int it) {
    double pi = 0;
    int eq_it = it * 4;
    for (int i=0; i<eq_it; i++) {
        pi += (1 / ((double)i * 2 + 1) * ((i%2==0) ? 1 : -1));
    }
    return pi*4;
}

int main(int argc, char* argv[]) {
    if (argc < 2) {
        printf("Please specific a run mode.\n");
        return 1;
    }
    double pi;
    struct timespec start = {0,0}, end={0,0};
    clock_gettime(CLOCK_MONOTONIC, &start);
    if (strcmp(argv[1], "inplace") == 0) {
        pi = pibench_inplace(IT);
    } else if (strcmp(argv[1], "fast") == 0) {
        pi = pibench_fast(IT);
    } else {
        sleep(1);
        printf("Please specific a valid run mode.\n");
    }
    clock_gettime(CLOCK_MONOTONIC, &end);
    printf("Pi: %f\n", pi);
    printf("Time: %f\n", ((double)end.tv_sec + 1.0e-9*end.tv_nsec) - ((double)start.tv_sec + 1.0e-9*start.tv_nsec));
    return 0;
}

This is how I built and ran the C code:

$ gcc GSL_pi.c -O3 -march=native -static $(gsl-config --cflags --libs) -o GSL_pi && ./GSL_pi inplace

Pi: 3.141592
Time: 0.061561

This is the JVM-platform code in question (written in Kotlin):

package joml_pi

import org.joml.Vector4d
import kotlin.time.measureTimedValue
import kotlin.time.DurationUnit


fun pibench(count: Int=1000000): Double {
    val d = Vector4d(1.0, 3.0, 5.0, 7.0)
    val w = Vector4d(1.0, -1.0, 1.0, -1.0)
    val c = Vector4d(1.0, 1.0, 1.0, 1.0)
    val scratchpad = Vector4d()
    var pi = 0.0
    for (i in 0..count) {
        scratchpad.set(i*8.0)
        scratchpad.add(d)
        c.div(scratchpad, scratchpad)
        scratchpad.mul(w)
        pi += scratchpad.x + scratchpad.y + scratchpad.z + scratchpad.w
    }
    return pi * 4.0
}

@kotlin.time.ExperimentalTime
fun <T> benchmark(func: () -> T, name: String="", count: Int=5) {
    val times = mutableListOf<Double>()
    val results = mutableListOf<T>()
    for (i in 0..count) {
        val result = measureTimedValue<T>( { func() } )
        results.add(result.value)
        times.add(result.duration.toDouble(DurationUnit.SECONDS))
    }
    println(listOf<String>(
            "",
            name,
            "Results:",
            results.joinToString(", "),
            "Times:",
            times.joinToString(", ")
    ).joinToString("\n"))
}

@kotlin.time.ExperimentalTime
fun main(args: Array<String>) {
    benchmark<Double>(::pibench, "pibench")
}

This is how I built and ran the JVM-platform code:

$ kotlinc -classpath joml-1.10.5.jar JOML_pi.kt && kotlin -classpath joml-1.10.5.jar:. joml_pi/JOML_piKt.class

pibench
Results:
3.1415924035900464, 3.1415924035900464, 3.1415924035900464, 3.1415924035900464, 3.1415924035900464, 3.1415924035900464
Times:
0.026850784, 0.014998012, 0.013095291, 0.012805373, 0.012977388, 0.012948186

There are multiple possiblities I have considered for why this operation run in the JVM here is apparently several times faster than the equivalent C code. I do not think any of them are particularly compelling:

I'm doing different iteration counts by an order of magnitude in the two languages. — It's possible I'm grossly misreading the code, but I'm pretty sure this isn't the case.
I've fudged up the algorithm and am doing vastly different things in each case. — Again maybe I've misread it, but I don't think that's happening, and both cases do produce numerically correct results.
The timing mechanism I use for C introduces a lot of overhead. — I also tested simpler and no-op functions. They completed and were measured as expected in much less time.
The JVM code is parallelized across multiple processor cores — With many more iterations, I watched my CPU use over a longer period and it did not exceed one core.
The JVM code makes better use of SIMD/vectorization. — I compiled the C with -O3 and -march=native, statically linking against libraries from Debian packages. In another case I even tried the -floop/-ftree parallelization flags. Either way performance didn't really change.
GSL has extra features that add overhead in this particular test. — I also have another version, with the vector class implemented and used through Cython, that does only the basics (iterating over a pointer), and performs roughly equivalently to GSL (with slightly more overhead, as expected). So that seems to be the limit for native code.
JOML is actually using native code. — The README says it makes no JNI calls, I'm importing directly from a multi-platform .jar file that I've checked and contains only .class files, and the JNI adds ~20 Java ops of overhead to every call so even if it had magical native code that shouldn't help anyway at such a granular level.
The JVM has different specifics for floating point arithmetic. — The JOML class I used accepts and returns "doubles" just as the C code. In any case, having to emulate a specification that deviates from hardware capabilities probably shouldn't improve performance like this.
The exponential reciprocal step in my GSL code is less efficient than the division reciprocal step in my JOML code. — While commenting that out does reduce total execution time by around 25% (to ~0.045s), that still leaves a massive 3X gap with the JVM code (~0.015s).

The only remaining explanation I can think of is that most of the time spent in C is overhead from doing function calls. This would seem consistent with the fact that implementations in C and Cython perform similarly. Then, the performance advantage of the Java/Kotlin/JVM implementation comes from its JIT being able to optimize away the function calls by effectively inlining everything in the loop. However, given the reputation of JIT compilers as being only theoretically, slightly faster than native code in favourable conditions, that still seems like a massive speedup just from having a JIT.

I suppose if that is the case, then a follow-up question would be whether I could realistically or reliably expect these performance characteristics to carry over outside of a synthetic toy benchmark, in applications that may have much more scattered numeric calls rather than a single million-iteration loop.

Solution

First, a disclaimer: I am the author of JOML.

Now, you are probably not comparing apples with apples here. GSL is a general purpose linear algebra library supporting many different linear algebra algorithms and data structures.

JOML, on the other hand is not a general purpose linear algebra library, but a special purpose library covering only the use-cases of compute graphics, so it only contains very concrete classes for only 2-, 3- and 4-dimensional vectors and 2x2, 3x3 and 4x4 (and non-square variants) matrices. In other words, even if you wanted to allocate a 5-dimensional vector, you couldn't with JOML.

Therefore, all the algorithms and data structures in JOML are explicitly designed on classes with x, y, z and w fields. Without any loops. So, a 4-dimensional vector add is literally just:

dest.x = this.x + v.x;
dest.y = this.y + v.y;
dest.z = this.z + v.z;
dest.w = this.w + v.w;

And there isn't even any SIMD involved in that, because as of now, there is no JVM JIT that can auto-vectorize over different fields of a class. Thus, a vector add (or multiply; or any lane-wise) operation right now will produce exactly these scalar operations.

Next, you say:

JOML is actually using native code. — The README says it makes no JNI calls, I'm importing directly from a multi-platform .jar file that I've checked and contains only .class files, and the JNI adds ~20 Java ops of overhead to every call so even if it had magical native code that shouldn't help anyway at such a granular level.

JOML itself does not define and use native code via the JNI interface. Of course, the operators and JRE methods that JOML uses internally will get intrinsified to native code, but not via the JNI interface. Rather, all methods (such as Math.fma()) will get intrinsified directly into their machine code equivalents at JIT compilation time.

Now, as pointed out by others in the comments to your questions: You are using a linked library (as opposed to a headers-only library like GLM, which would probably be a better fit for your C/C++ code). So, a C/C++ compiler probably won't be able to "see through" your call-site to the callee and apply optimizations there based on the static information that it has at the callsite (like you calling gsl_vector_calloc with the argument 4). So, every runtime checking/branching on the argument that GSL needs to do, will still have to happen at runtime. This would be quite different to when using a headers-only library (like GLM), where any half-decent C/C++ will for sure optimize all the things away based on the static knowledge of your calls/code. And I would assume that, yes, an equivalent C/C++ program would beat a Java/Scala/Kotlin/JVM program in speed.

So, your comparison of GSL and JOML is somewhat like comparing the performance of Microsoft Excel evaluating a cell with content = 1 + 2 with writing C code that effectively outputs printf("%f\n", 1.0 + 2.0);. The former (Microsoft Excel, here being GSL) is much more general and versatile while the latter (JOML) is highly specialized.

It just so happens that the specialization fits to your exact use-case right now, making it even possible to use JOML for that.