Large loop was ignored by the intel compiler?

All:

I have a very simple C test code using the Intel compiler to do some timing for a large loop for float point operation, the code (test.c) is as follows:

#include <sys/time.h>
#include <time.h>
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <omp.h>

int main(char *argc, char **argv) {
      const long N = 1000000000;
      double t0, t1, t2, t3;
      double sum=0.0;
      clock_t start, end;
      struct timeval r_start, r_end;
      long i;
      gettimeofday(&r_start, NULL);
      start = clock();
      for (i=0;i<N;i++)
          sum += i*2.0+i/2.0; // doing some floating point operations
      end = clock();
      gettimeofday(&r_end, NULL);
      double cputime_elapsed_in_seconds = (end - start)/(double)CLOCKS_PER_SEC;
      double realtime_elapsed_in_seconds = ((r_end.tv_sec * 1000000 + r_end.tv_usec)
                - (r_start.tv_sec * 1000000 + r_start.tv_usec))/1000000.0;
      printf("cputime_elapsed_in_sec: %e\n", cputime_elapsed_in_seconds);
      printf("realtime_elapsed_in_sec: %e\n", realtime_elapsed_in_seconds);
      //printf("sum= %4.3e\n", sum);
      return 0;
}

However when I tried to compile and run it with Intel 13.0 compiler, the large loop seems to be ignored and the execution resulted in zero timing:

$ icc test.c
$ ./a.out
cputime_elapsed_in_sec: 0.000000e+00
realtime_elapsed_in_sec: 9.000000e-06

Only if I print the sum (uncomment line 26), the loop will actually be executed:

$ icc test.c
$ ./a.out
cputime_elapsed_in_sec: 2.730000e+00
realtime_elapsed_in_sec: 2.736198e+00
sum= 1.250e+18

The question is why the loop seems not executed if I do not print the sum value?

The same issue does not occur with gcc-4.4.7 compilers, I guess the intel compiler might have done some optimization that if the variable is not referenced, the loop is probably ignored?

The system information is as follows:

$ uname -a
Linux node001 2.6.32-642.11.1.el6.x86_64 #1 SMP Wed Oct 26 10:25:23 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
$ icc -v
icc version 13.0.0 (gcc version 4.4.7 compatibility)
$ gcc -v
Using built-in specs.
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre --enable-libgcj-multifile --enable-java-maintainer-mode --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib --with-ppl --with-cloog --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.4.7 20120313 (Red Hat 4.4.7-17) (GCC)

Thanks for any suggestions!

Roy

Solution

Given your observation that printing the final value slows it down^(a), there's a fairly good chance that the optimiser is figuring out that you're not actually using sum for anything after you've calculated it, so it's optimising the entire calculation loop out of existence.

I actually saw something similar quite a while ago when we were testing the performance of the latest VAX 11/780 machine our university had received (showing my age there). It was faster by a factor of several thousand percent for exactly the same reason, the new optimising compiler having decided that the loop wasn't actually needed.

To be certain, you'd have to examine the assembly output. I believe this can be done with icc by using the -Fa <asmFileName> option and then examining the file whose name you used in place of <asmFileName>.

^(a) The other possibility I thought of seems to be discounted here.

That was the possibility that, given the range of i is constant (based on N) and that the calculation otherwise involves constants, it may be that the compiler itself had calculated the final value while compiling it, resulting in a simple constant load operation.

I've seen gcc do this sort of thing at its -O3 "insane" optimisation level.

I discount that possibility since the printing of the value would most likely not affect this operation.