I have C++ code processing three consecutive values from one single 1800-element array. The code compiled by ICC 14.0 is approximately 68% slower (1600 vs 2700 CPU cycles) than the code produced by the MSVC. I cannot understand why. Could somebody please help? Even when I set the Intel compiler -O3 switch it doesn't change the timing. The CPU is Ivy Bridge.
#include <iostream>
int main(){
int data[1200];
//Dummy-populate data
for(int y=0; y<1200; y++){
data[y] = y/2 + 7;
}
int counter = 0;
//Just to repeat the test
while(counter < 10000){
int Accum = 0;
long long start = 0;
long long end = 0;
int p = 0;
start = __rdtsc();
while(p < 1200){
unsigned int level1 = data[p];
unsigned int factor = data[p + 1];
Accum += (level1 * factor);
p = p + 2;
}
end = __rdtsc();
std::cout << (end - start) << " " << Accum << std::endl;
counter++;
}
}
ICC sucks here because it's working out the addresses for each data[n]
access ala mov edi,dword ptr [rsp+rax*4+44h]
... all that run-time multiplication is expensive. You should be able to avoid it by recoding so the indices are constants (could also use *p_data++
three times, but that introduces a sequencing issue that may adversely affect performance).
for (unsigned* p_data = &data[0], *p_end = data + 1800; p_data < p_end; p_data += 3)
{
unsigned level1 = p_data[0];
unsigned level2 = p_data[1];
unsigned factor = p_data[2];
Accum1 += level1 * factor;
Accum2 += level2 * factor;
}