Why in this case, my Java code runs faster than my C++ code?

I wrote a small benchmark, in which the program creates 10⁸ 2-Dimensional std::vector structures of {float, float}, and then sums up the square of their lengths.

Here is the C++ code:

#include <iostream>
#include <chrono>
#include <vector>
#include <array>
#include <cmath>
    
using namespace std;
using namespace std::chrono;
    
const int COUNT = pow(10, 8);
    
class Vec {
public:
    float x, y;
    
    Vec() {}
    
    Vec(float x, float y) : x(x), y(y) {}
    
    float len() {
        return x * x + y * y;
    }
};
    
int main() {
    vector <Vec> vecs;
    
    for(int i = 0; i < COUNT; ++i) {
        vecs.emplace_back(i / 3, i / 5);
    }
    
    auto start = high_resolution_clock::now();
    
    // This loop is timed
    float sum = 0;
        for(int i = 0; i < COUNT; ++i) {
        sum += vecs[i].len();
    }
    
    auto stop = high_resolution_clock::now();
    
    cout << "finished in " << duration_cast <milliseconds> (stop - start).count()
         << " milliseconds" << endl;
    cout << "result: " << sum << endl;
    
    return 0;
}

For which I used this makefile (g++ version 7.5.0):

build:
 g++ -std=c++17 -O3 main.cpp -o program #-ffast-math 
    
run: build
 clear
 ./program

Here is my Java code:

public class MainClass {
    static final int COUNT = (int) Math.pow(10, 8);

    static class Vec {
        float x, y;

        Vec(float x, float y) {
            this.x = x;
            this.y = y;
        }

        float len() {
            return x * x + y * y;
        }
    }

    public static void main(String[] args) throws InterruptedException {

        Vec[] vecs = new Vec[COUNT];

        for (int i = 0; i < COUNT; ++i) {
            vecs[i] = new Vec(i / 3, i / 5);
        }

        long start = System.nanoTime();

        // This loop is timed
        float sum = 0;
        for (int i = 0; i < COUNT; ++i) {
            sum += vecs[i].len();
        }

        long duration = System.nanoTime() - start;
        System.out.println("finished in " + duration / 1000000 + " milliseconds");
        System.out.println("result: " + sum);
    }
}

Which was compiled and ran using Java 11.0.4

And here are the results (the average of a few runs, ran on ubuntu 18.04 16bit):

c++:  262 ms
java: 230 ms

Here are a few things I have tried in order to make the c++ code faster:

Use std::array instead of std::vector
Use a plain array instead of std::vector
Use an iterator in the for loop

However, none of the above resulted in any improvement.

I have noticed a few interesting things:

When I time the whole main() function (allocation + computation), C++ is much better. However, this might be due to the warm up time of the JVM.
For a lower number of objects, like 10⁷, C++ was slightly faster (by a few milliseconds).
Turning on -ffast-math makes the C++ program a few times faster than Java, however the result of the computation is slightly different. Furthermore, I read in a few posts that using this flag is unsafe.

Can I somehow modify my C++ code and make it as fast or faster than Java in this case?

Solution

Try this:

    float sum = std::transform_reduce(
        std::execution::par_unseq,
        begin(vecs), end(vecs),
        0.f,
        std::plus<>{},
        [](auto&& x){
            return x.len();
        }
    );

this explicitly tells the C++ compiler what you are doing, that you are ok with using extra threads, that each loop iteration doesn't depend on the others, and that you want to do the work in floats.

It does mean that the addition may occur out of order compared to what you asked for, so the output value may not be exactly the same.

Live example with the raw loop on one side, and permission to out-of-order add on the other.

Further investigation:

So I spun up a godbolt.

In it, I compared this with and without forced vectorization and -ffast-math. Forcing vectorization and -ffast-math resulted in the same assembly code.

The issue is the accumulator. Adding things one at a time into the sum and doing all of the IEEE rounding gives you a different value than accumulating them N at a time in higher precision floating point values, then storing the result en-bulk back into the float.

If you do -ffast-math you'll get 2x speed and a different accumulation. If you replace the float sum with double sum, you'll get the same answer as --ffast-math and vectorizaton on.

Basically, the clang vectorizer doesn't see an easy way to vectorize the accumulation of the sums without breaking the exact float-precision floating point requirements.