c++g++openmp compiler-optimization c++-chrono

Why std::chrono does not work when measuring the execution time of OpenMP parallelized for loops and compiler optimizations?

I have a simple code:

#include <iostream>
#include <chrono>

int main(int argc, char ** argv)
{
  int I=0;
  double time=0.0;
  for(int i=0; i<10; ++i)
  {
    auto begin1=std::chrono::steady_clock::now();
  #pragma omp parallel for simd
    for(int j=0; j<1000000; ++j) I=j;
    auto end1=std::chrono::steady_clock::now();
    auto timei=std::chrono::duration_cast<std::chrono::milliseconds>(end1-begin1).count();
    std::cout<<"time 1:"<<time<<std::endl;
    time+=timei;
    std::cout<<"time 2:"<<time<<std::endl;
  }
  return 0;
}

Use g++ 5.3.1 and compile line:

cmake . -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DCMAKE_CXX_FLAGS="-O2 -fopenmp"

But the output is:

time 1:0
time 2:11
time 1:11
time 2:16
time 1:16
time 2:16
time 1:16
time 2:16
time 1:16
time 2:16
time 1:16
time 2:16
time 1:16
time 2:16
time 1:16
time 2:16
time 1:16
time 2:16
time 1:16
time 2:16

You see, i can't measure the execution time properly using std::chrono!

Why? What is going on? How to measure the execution time?

This is for "-O2" and "-O1" compiler optimization flags. With "-O0" everything works properly. Why?

The same situation is when i use Intel compiler icpc 19.0.1.144 and the compile line:

cmake .-DCMAKE_C_COMPILER=icc -DCMAKE_CXX_COMPILER=icpc -DCMAKE_CXX_FLAGS="-march=native -mtune=native -O2 -ipo16 -mcmodel=large"

If to use "-O2" and "-O1" compiler flags, the time is not measured properly, but if to replace them with"-O0", std::chrono works properly.

Speaking frankly, I am shocked.

But the question is the same: Why execution time measurement using std::chrono does not work properly here with "-O1" and "-O2" and works with "O0"? And how to properly measure the execution time in this piece of code?

Let me, please, update the code sample:

#include <iostream>
#include <chrono>
#include <ctime>
#include <omp.h>

int array[10000000]{0};

int main(int argc, char ** argv)
{
  clock_t t;
  double time=0.0;
  for(int i=0; i<10; ++i)
  {
    auto begin1=std::chrono::steady_clock::now();
    t=clock();
  #pragma omp parallel for simd
    for(int j=0; j<1000000; ++j) array[j]=j;
    auto end1=std::chrono::steady_clock::now();
    auto timei=std::chrono::duration_cast<std::chrono::milliseconds>(end1-begin1).count();
    std::cout<<"time 1:"<<time<<std::endl;
    time+=timei;
    std::cout<<"time 2:"<<time<<std::endl;
    t=clock()-t;
    printf(\nt%i=%f\n", i, (double)t/CLOCKS_PER_SEC);
  }
  return 0;
}

Now the std:chrono timer updates properly. But sometimes the results of std::clock and std::chrono differ significantly. Suppose, std::chrono is more accurate and its timing should be used.

So, as @Hamza, answered below, the compiler simply threw away the blocks of code which did nothing. But both the Intel and g++ compilers did not warn me about anything.

So, for the future, do not write for loops which do nothing. The compiler may simply throw away the piece of code which has no effect.

In my full code i tried to compare the relative performance of 2 functions, returning the same result: 1 - returned the value interpolating a table and 2 - calculating it from a formula (formula is an approximation of the table points). My error was that i wrote the results in the inner loop in temporary stack variables, simply did nothing. The compiler threw it away. I should write the values in the inner loop into an array, or accumulate them in any other way, simply, do something useful, what the compiler will not throw away.

That is how i understood it.

Solution

My guess is that the compiler simply optimizes away what you're doing in your loop since it decides it is useless. With the following code you get some actual ms:

#include <iostream>
#include <chrono>
#include <omp.h>

int main(int argc, char ** argv)
{
    int I=0;
    double time=0.0;
    for(int i=0; i<10; ++i)
    {
        auto begin1=std::chrono::steady_clock::now();
        #pragma omp parallel for simd
        for(int j=0; j<100000000; ++j) I+=j;
        auto end1=std::chrono::steady_clock::now();
        auto timei=std::chrono::duration_cast<std::chrono::milliseconds>(end1-   begin1).count();

        std::cout << I << std::endl;
        std::cout<<"time 1:"<<time<<std::endl;    
        time+=timei;
        std::cout<<"time 2:"<<time<<std::endl;
    }
    return 0;
}

I get the following output:

887459712
time 1:0
time 2:71
1774919424
time 1:71
time 2:142
-1632588160
time 1:142
time 2:213
-745128448
time 1:213
time 2:283
142331264
time 1:283
time 2:351
1029790976
time 1:351
time 2:419
1917250688
time 1:419
time 2:487
-1490256896
time 1:487
time 2:555
-602797184
time 1:555
time 2:623
284662528
time 1:623
time 2:692