I have code
#include <iostream>
#include <vector>
#include <ctime>
using namespace std;
void foo(int n, double* a, double* b, double *c, double*d, double* e, double* f, double* g)
{
for (int i = 0; i < n; ++i)
a[i] = b[i] * a[i] + c[i] * (d[i] + e[i] + f[i] + g[i]);
}
int main()
{
int m = 1001001;
vector<double> a(m), b(m), c(m), d(m), f(m);
clock_t start = std::clock();
for (int i = 0; i < 1000; ++i)
foo(1000000, &a[0], &b[0], &c[0], &d[0], &d[1], &f[0], &f[1000] );
double duration = (std::clock() - start) / (double)CLOCKS_PER_SEC;
cout << "Finished in " << duration << " seconds [CPU Clock] " << endl;
}
Can you give me a workable example to optimize it with better performance? Any compiler is fine, like Intel c++ compiler and visual c++ compiler. Please suggest a CPU with good performance to do such job.
On apple clang, I tried:
__restict__
on the arguments to convince the compiler that there was no aliasing.result: no change
foo()
result: computation time increased from ~3 seconds to ~18seconds!
#pragma omp parallel for
result: compiler ignored me and stayed with the original solution. ~3 seconds.
-march=native
to allow the cpu's full awesomeness to shineresult: different assembler output (vectorisation applied), but run time still unchanged at ~3s
initial conclusions:
This problem is bound by memory access and not by the CPU.