Search code examples
c++performanceoptimizationcpuicc

Intel compiler produces code 68% slower than MSVC (full example provided)


I have C++ code processing three consecutive values from one single 1800-element array. The code compiled by ICC 14.0 is approximately 68% slower (1600 vs 2700 CPU cycles) than the code produced by the MSVC. I cannot understand why. Could somebody please help? Even when I set the Intel compiler -O3 switch it doesn't change the timing. The CPU is Ivy Bridge.

#include <iostream>

int main(){
        int data[1200];

        //Dummy-populate data
        for(int y=0; y<1200; y++){
            data[y] = y/2 + 7;
        }

        int counter = 0;

        //Just to repeat the test
        while(counter < 10000){

            int Accum = 0;
            long long start = 0;
            long long end = 0;
            int p = 0;

            start = __rdtsc();

            while(p < 1200){
                unsigned int level1 = data[p];  
                unsigned int factor = data[p + 1];
                Accum += (level1 * factor);
                p = p + 2;
            }

            end = __rdtsc();
            std::cout << (end - start) << "  " << Accum << std::endl;
            counter++;
        }
}

Solution

  • ICC sucks here because it's working out the addresses for each data[n] access ala mov edi,dword ptr [rsp+rax*4+44h]... all that run-time multiplication is expensive. You should be able to avoid it by recoding so the indices are constants (could also use *p_data++ three times, but that introduces a sequencing issue that may adversely affect performance).

    for (unsigned* p_data = &data[0], *p_end = data + 1800; p_data < p_end; p_data += 3)
    {
        unsigned level1 = p_data[0];
        unsigned level2 = p_data[1];
        unsigned factor = p_data[2];
    
        Accum1 += level1 * factor;
        Accum2 += level2 * factor;
    }