java c++real-time compiler-optimization microbenchmark

When comparing Java with C++ for speed should I compile the C++ code with -O3 or -O2?

I am writing a variety of equivalent programs in Java and C++ to compare the two languages for speed. Those programs employ heavy mathematical computations in a loop.

Interestingly enough I find that C++ beats Java when I use -O3. When I use -O2 Java beats C++.

Which g++ compiler optimization should I use to reach a conclusion about my comparisons?

I know this is not as simple to conclude as it sounds, but I would like to have some insights about latency/speed comparisons between Java and C++.

Solution

Interestingly enough I find that C++ beats Java when I use -O3. When I use -O2 Java beats C++.

-O3 will certainly beat -O2 in microbenchmarks but when you benchmark a more realistic application (such as a FIX engine) you will see that -O2 beats -O3 in terms of performance.

As far as I know, -O3 does a very good job compiling small and mathematical pieces of code, but for more realistic and larger applications it can actually be slower than -O2. By trying to aggressively optimize everything (i.e. inlining, vectorization, etc.), the compiler will produce huge binaries leading to cpu cache misses (i.e. especially instruction cache misses). That's one of the reasons the Hotspot JIT chooses not to optimize big methods and/or non-hot methods.

One important thing to notice is that JIT uses methods as independent units eligible for optimization. In your previous questions, you have the following code:

int iterations = stoi(argv[1]);
int load = stoi(argv[2]);

long long x = 0;

for(int i = 0; i < iterations; i++) {

    long start = get_nano_ts(); // START clock

    for(int j = 0; j < load; j++) {
        if (i % 4 == 0) {
            x += (i % 4) * (i % 8);
        } else {
            x -= (i % 16) * (i % 32);
        }
    }

    long end = get_nano_ts(); // STOP clock

    // (omitted for clarity)
}

cout << "My result: " << x << endl;

But this code is JIT-unfriendly because the hot block of code is not in its own method. For major JIT gains, you should have placed the block of code inside the loop on its own method. Your method executes a hot block of code instead of a hot method. The method that contains the for loop is probably called only once so the JIT will not do anything about it.

When comparing Java with C++ for speed should I compile the C++ code with -O3 or -O2?

Well, if you use -O3 for microbenchmarks you will get amazing fast results that will be unrealistic for larger and more complex applications. That's why I think the judges use -O2 instead of -O3. For example, our garbage-free Java FIX engine is faster than C++ FIX engines and I have no idea if they are compiling with -O0, -O1, -O2, -O3 or a mix of them through executable linking.

In theory it is possible for a person to selective compartmentalize an entire C++ application in executable pieces, choose which ones are going to be compiled with -O2 and which ones are going to be compiled with -O3. Then link everything in an ideal binary executable. But in reality, how feasible is that?

The approach the Hotspot chooses is much simpler. It says:

Listen, I am going to consider each method as an independent unit of execution instead of any block of code anywhere. If that method is hot enough (i.e. called often) and small enough I will try to aggressively optimize it.

That of course has the drawback of requiring code warmup but it is much simpler and produces the best results most of the time for realistic/large/complex applications.

And last but not least, you should probably consider this question if you want to compile your entire application with -O3: When can I confidently compile program with -O3?