I wrote a small benchmark, in which the program creates 108 2-Dimensional std::vector
structures of {float, float}
, and then sums up the square of their lengths.
Here is the C++ code:
#include <iostream>
#include <chrono>
#include <vector>
#include <array>
#include <cmath>
using namespace std;
using namespace std::chrono;
const int COUNT = pow(10, 8);
class Vec {
public:
float x, y;
Vec() {}
Vec(float x, float y) : x(x), y(y) {}
float len() {
return x * x + y * y;
}
};
int main() {
vector <Vec> vecs;
for(int i = 0; i < COUNT; ++i) {
vecs.emplace_back(i / 3, i / 5);
}
auto start = high_resolution_clock::now();
// This loop is timed
float sum = 0;
for(int i = 0; i < COUNT; ++i) {
sum += vecs[i].len();
}
auto stop = high_resolution_clock::now();
cout << "finished in " << duration_cast <milliseconds> (stop - start).count()
<< " milliseconds" << endl;
cout << "result: " << sum << endl;
return 0;
}
For which I used this makefile (g++ version 7.5.0):
build:
g++ -std=c++17 -O3 main.cpp -o program #-ffast-math
run: build
clear
./program
Here is my Java code:
public class MainClass {
static final int COUNT = (int) Math.pow(10, 8);
static class Vec {
float x, y;
Vec(float x, float y) {
this.x = x;
this.y = y;
}
float len() {
return x * x + y * y;
}
}
public static void main(String[] args) throws InterruptedException {
Vec[] vecs = new Vec[COUNT];
for (int i = 0; i < COUNT; ++i) {
vecs[i] = new Vec(i / 3, i / 5);
}
long start = System.nanoTime();
// This loop is timed
float sum = 0;
for (int i = 0; i < COUNT; ++i) {
sum += vecs[i].len();
}
long duration = System.nanoTime() - start;
System.out.println("finished in " + duration / 1000000 + " milliseconds");
System.out.println("result: " + sum);
}
}
Which was compiled and ran using Java 11.0.4
And here are the results (the average of a few runs, ran on ubuntu 18.04 16bit):
c++: 262 ms
java: 230 ms
Here are a few things I have tried in order to make the c++ code faster:
std::array
instead of std::vector
std::vector
for
loopHowever, none of the above resulted in any improvement.
I have noticed a few interesting things:
main()
function (allocation + computation), C++ is much better. However, this might be due to the warm up time of the JVM.-ffast-math
makes the C++ program a few times faster than Java, however the result of the computation is slightly different. Furthermore, I read in a few posts that using this flag is unsafe.Can I somehow modify my C++ code and make it as fast or faster than Java in this case?
Try this:
float sum = std::transform_reduce(
std::execution::par_unseq,
begin(vecs), end(vecs),
0.f,
std::plus<>{},
[](auto&& x){
return x.len();
}
);
this explicitly tells the C++ compiler what you are doing, that you are ok with using extra threads, that each loop iteration doesn't depend on the others, and that you want to do the work in float
s.
It does mean that the addition may occur out of order compared to what you asked for, so the output value may not be exactly the same.
Live example with the raw loop on one side, and permission to out-of-order add on the other.
Further investigation:
So I spun up a godbolt.
In it, I compared this with and without forced vectorization and -ffast-math
. Forcing vectorization and -ffast-math
resulted in the same assembly code.
The issue is the accumulator. Adding things one at a time into the sum and doing all of the IEEE rounding gives you a different value than accumulating them N at a time in higher precision floating point values, then storing the result en-bulk back into the float.
If you do -ffast-math
you'll get 2x speed and a different accumulation. If you replace the float sum
with double sum
, you'll get the same answer as --ffast-math
and vectorizaton on.
Basically, the clang vectorizer doesn't see an easy way to vectorize the accumulation of the sums without breaking the exact float-precision floating point requirements.