Search code examples
c++loopsoptimizationvectorizationmulticore

How to optimize the following common loop?


I have code

#include <iostream>
#include <vector>
#include <ctime>
using namespace std;

void foo(int n, double* a, double* b, double *c, double*d, double* e, double* f, double* g)
{
    for (int i = 0; i < n; ++i)
        a[i] = b[i] * a[i] + c[i] * (d[i] + e[i] + f[i] + g[i]);
}

int main()
{
    int m = 1001001;
    vector<double> a(m), b(m), c(m), d(m), f(m);

    clock_t start = std::clock();

    for (int i = 0; i < 1000; ++i)
        foo(1000000, &a[0], &b[0], &c[0], &d[0], &d[1], &f[0], &f[1000] );

    double duration = (std::clock() - start) / (double)CLOCKS_PER_SEC;
    cout << "Finished in " << duration << " seconds [CPU Clock] " << endl;
}

Can you give me a workable example to optimize it with better performance? Any compiler is fine, like Intel c++ compiler and visual c++ compiler. Please suggest a CPU with good performance to do such job.


Solution

  • On apple clang, I tried:

    • using __restict__ on the arguments to convince the compiler that there was no aliasing.

    result: no change

    • distributing the computation over 8 threads in foo()

    result: computation time increased from ~3 seconds to ~18seconds!

    • using #pragma omp parallel for

    result: compiler ignored me and stayed with the original solution. ~3 seconds.

    • setting the command line option -march=native to allow the cpu's full awesomeness to shine

    result: different assembler output (vectorisation applied), but run time still unchanged at ~3s

    initial conclusions:

    This problem is bound by memory access and not by the CPU.