Search code examples
c++g++openmpcompiler-optimizationc++-chrono

Why std::chrono does not work when measuring the execution time of OpenMP parallelized for loops and compiler optimizations?


I have a simple code:

#include <iostream>
#include <chrono>

int main(int argc, char ** argv)
{
  int I=0;
  double time=0.0;
  for(int i=0; i<10; ++i)
  {
    auto begin1=std::chrono::steady_clock::now();
  #pragma omp parallel for simd
    for(int j=0; j<1000000; ++j) I=j;
    auto end1=std::chrono::steady_clock::now();
    auto timei=std::chrono::duration_cast<std::chrono::milliseconds>(end1-begin1).count();
    std::cout<<"time 1:"<<time<<std::endl;
    time+=timei;
    std::cout<<"time 2:"<<time<<std::endl;
  }
  return 0;
}

Use g++ 5.3.1 and compile line:

cmake . -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DCMAKE_CXX_FLAGS="-O2 -fopenmp"

But the output is:

time 1:0
time 2:11
time 1:11
time 2:16
time 1:16
time 2:16
time 1:16
time 2:16
time 1:16
time 2:16
time 1:16
time 2:16
time 1:16
time 2:16
time 1:16
time 2:16
time 1:16
time 2:16
time 1:16
time 2:16

You see, i can't measure the execution time properly using std::chrono!

Why? What is going on? How to measure the execution time?

This is for "-O2" and "-O1" compiler optimization flags. With "-O0" everything works properly. Why?

The same situation is when i use Intel compiler icpc 19.0.1.144 and the compile line:

cmake .-DCMAKE_C_COMPILER=icc -DCMAKE_CXX_COMPILER=icpc -DCMAKE_CXX_FLAGS="-march=native -mtune=native -O2 -ipo16 -mcmodel=large"

If to use "-O2" and "-O1" compiler flags, the time is not measured properly, but if to replace them with"-O0", std::chrono works properly.

Speaking frankly, I am shocked.

But the question is the same: Why execution time measurement using std::chrono does not work properly here with "-O1" and "-O2" and works with "O0"? And how to properly measure the execution time in this piece of code?

Let me, please, update the code sample:

#include <iostream>
#include <chrono>
#include <ctime>
#include <omp.h>

int array[10000000]{0};

int main(int argc, char ** argv)
{
  clock_t t;
  double time=0.0;
  for(int i=0; i<10; ++i)
  {
    auto begin1=std::chrono::steady_clock::now();
    t=clock();
  #pragma omp parallel for simd
    for(int j=0; j<1000000; ++j) array[j]=j;
    auto end1=std::chrono::steady_clock::now();
    auto timei=std::chrono::duration_cast<std::chrono::milliseconds>(end1-begin1).count();
    std::cout<<"time 1:"<<time<<std::endl;
    time+=timei;
    std::cout<<"time 2:"<<time<<std::endl;
    t=clock()-t;
    printf(\nt%i=%f\n", i, (double)t/CLOCKS_PER_SEC);
  }
  return 0;
}

Now the std:chrono timer updates properly. But sometimes the results of std::clock and std::chrono differ significantly. Suppose, std::chrono is more accurate and its timing should be used.

So, as @Hamza, answered below, the compiler simply threw away the blocks of code which did nothing. But both the Intel and g++ compilers did not warn me about anything.

So, for the future, do not write for loops which do nothing. The compiler may simply throw away the piece of code which has no effect.

In my full code i tried to compare the relative performance of 2 functions, returning the same result: 1 - returned the value interpolating a table and 2 - calculating it from a formula (formula is an approximation of the table points). My error was that i wrote the results in the inner loop in temporary stack variables, simply did nothing. The compiler threw it away. I should write the values in the inner loop into an array, or accumulate them in any other way, simply, do something useful, what the compiler will not throw away.

That is how i understood it.


Solution

  • My guess is that the compiler simply optimizes away what you're doing in your loop since it decides it is useless. With the following code you get some actual ms:

    #include <iostream>
    #include <chrono>
    #include <omp.h>
    
    int main(int argc, char ** argv)
    {
        int I=0;
        double time=0.0;
        for(int i=0; i<10; ++i)
        {
            auto begin1=std::chrono::steady_clock::now();
            #pragma omp parallel for simd
            for(int j=0; j<100000000; ++j) I+=j;
            auto end1=std::chrono::steady_clock::now();
            auto timei=std::chrono::duration_cast<std::chrono::milliseconds>(end1-   begin1).count();
    
            std::cout << I << std::endl;
            std::cout<<"time 1:"<<time<<std::endl;    
            time+=timei;
            std::cout<<"time 2:"<<time<<std::endl;
        }
        return 0;
    }
    

    I get the following output:

    887459712
    time 1:0
    time 2:71
    1774919424
    time 1:71
    time 2:142
    -1632588160
    time 1:142
    time 2:213
    -745128448
    time 1:213
    time 2:283
    142331264
    time 1:283
    time 2:351
    1029790976
    time 1:351
    time 2:419
    1917250688
    time 1:419
    time 2:487
    -1490256896
    time 1:487
    time 2:555
    -602797184
    time 1:555
    time 2:623
    284662528
    time 1:623
    time 2:692