atomic operations on a variable which is in OpenMP reduction clause

I have a following piece of code. Its idea is very simple. There are billions of events in my full program and I need to count some of them not using long int type. So, I have to use 2 int numbers HIT and COUNT istead of 1 int number because there will be an overflow of 1 int variable (very big loop count).

#include <fstream>
#include <cstring>
#include <cmath>
#include <random>
#include <limits>
#include <chrono>

using namespace std;

int N=1000000000;
long int K=20*N;
int HIT=0;
int COUNT=0;
long int MAX=std::numeric_limits<int>::max();

int main(int argc, char **argv)
{
  auto begin=std::chrono::steady_clock::now();
  for(long int i=0; i<K; ++i)
  {
    ++HIT;
    if(HIT == MAX)
    {
      ++COUNT;
      HIT=0;
      cout<<"COUNT="<<COUNT<<endl;
    }
  }
  auto end=std::chrono::steady_clock::now();
  cout<<"HIT="<<HIT<<endl;
  cout<<"COUNT="<<COUNT<<endl;

  const long int Total = HIT+COUNT*MAX;
  cout<<"Total="<<Total<<" MAX="<<MAX<<endl;
  if(Total==K) cout<<"Total == K"<<endl;
  else         cout<<"Total != K"<<endl;
  auto elapsed_ms=std::chrono::duration_cast<std::chrono::milliseconds>(end-begin);
  std::cout<<"time="<<elapsed_ms.count()<<" ms"<<std::endl;
  return 0;
}

The code works properly in 1 thread and gives the following output:

COUNT=1
COUNT=2
COUNT=3
COUNT=4
COUNT=5
COUNT=6
COUNT=7
COUNT=8
COUNT=9
HIT=672647177
COUNT=9
Total=20000000000 MAX=2147483647
Total == K
time=30971 ms

I need to have it working in parallel using OpenMP not using mutexes or some functions connected with compiler implementation, if possible. But when I modify it as:

#pragma omp parallel for simd reduction(+:HIT,COUNT)
  for(long int i=0; i<K; ++i)

the output is the following:

HIT=20000000000
COUNT=0
Total=20000000000 MAX=2147483647
Total == K
time=2771 ms

Finally, when I modify the code as:

#pragma omp parallel for simd reduction(+:HIT,COUNT)
for(long int i=0; i<K; ++i)
{
  ++HIT;
  if(HIT == MAX)
  {
    ++COUNT;
  #pragma omp atomic write
    HIT=0;
    cout<<"COUNT="<<COUNT<<endl;
  }
}

the output is:

COUNT=1
COUNT=1
COUNT=1
COUNT=1
COUNT=1
COUNT=1
COUNT=1
COUNT=1
HIT=2820130824
COUNT=8
Total=20000000000 MAX=2147483647
Total == K
time=4232 ms

Could anyone be so kind to explain me what is going on and why the output is so much different?

I need to get the code properly working in parallel using OpenMP, so how to properly do it?

#pragma omp atomic write

correct or should I write

#pragma omp atomic update?

Is it possible to write atomic operations on the values which already are in OpenMP reduction clause?

Use Intel C++ 2019 compiler.

g++ does not allow to use simd in

#pragma omp parallel for simd reduction(+:HIT,COUNT)

and if to drop simd, the code works incorrectly using g++.

Solution

A simple + reduction won't work for two integers that aren't exactly summed independently, but since OpenMP 4.0 you can declare your own reductions. All you need to do is abstract the two parts of the counter in a class (or struct) and define a function that sums such objects. In the example below, an overloaded compound assignment operator (+=) is used:

#include <limits>
#include <iostream>
#include <omp.h>

using namespace std;

const long int MAX = std::numeric_limits<int>::max();
const long int K = MAX + 20L;

class large_count {
   int count, hit;
public:
   large_count() : count(0), hit(0) {}

   // Prefix increment operator
   large_count& operator++() {
      hit++;
      if (hit == MAX) {
         hit = 0;
         count++;
      }
      return *this;
   }

   // Compound assignment operator
   large_count& operator+=(const large_count& other) {
      count += other.count;
      long int sum_hit = (long)hit + other.hit;
      if (sum_hit >= MAX) {
         count++;
         hit = sum_hit - MAX;
      }
      else
         hit = sum_hit;
      return *this;
   }

   long total() const { return hit + count * MAX; }
};

#pragma omp declare reduction (large_sum : large_count : omp_out += omp_in)

int main() {
   large_count cnt;
   double t = -omp_get_wtime();
   #pragma omp parallel for reduction(large_sum : cnt)
   for (long int i = 0; i < K; i++)
      ++cnt;
   t += omp_get_wtime();
   cout << (cnt.total() == K ? "YES" : "NO") << endl;
   cout << t << " s" << endl;
}

The custom reduction is declared using:

#pragma omp declare reduction (large_sum : large_count : omp_out += omp_in)

There are three parts of the declaration:

large_sum - this is the name given to the custom reduction operation
large_count - this is type that the reduction operates on
omp_out += omp_in - this is the combiner expression. omp_out and omp_in are special pseudo-variables provided by the OpenMP runtime. They are both of type large_count. The combiner expression has to combine the two values and update the value of omp_out.

Sample output:

$ g++ --version
g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
...
$ g++ -std=c++11 -fopenmp -o cnt cnt.cc
$ OMP_NUM_THREADS=1 ./cnt
YES
9.39628 s
$ OMP_NUM_THREADS=3 ./cnt
YES
3.79765 s