Openmp multithreaded code giving different answer when using multiple threads

I have the following monte carlo code with openmp multithreading

int main()
{
  std::uniform_int_distribution<int> dir(0, 1);
  std::uniform_int_distribution<int> xyz(0, 2);
  std::uniform_real_distribution<double> angle(0,360);
  mt19937 gen{0};

  auto M = 20;
  long double sum = 0;
  auto num_trials = 10000;
  // omp_set_num_threads(12);

  #pragma omp parallel
  {
    double loc_sum = 0.0;
    #pragma omp for
    for(int i = 0; i < num_trials; ++i)
    {
      double x = 0;
      double y = 0;
      double z = 0;
      double r = 0; 
      auto N = 0;
      
      while(r < M)
      {
        auto d = dir(gen);
        auto p = xyz(gen);
        if(p == 0)
        {
          x += (d == 1) ? -1 : 1;
        }
        else if(p == 1)
        {
          y += (d == 1) ? -1 : 1;
        }
        else
        {
          z += (d == 1) ? -1 : 1;
        }

        r = std::sqrt(x * x + y * y + z * z);
        ++N;
      }

      loc_sum += N;
    }

    #pragma omp critical
      sum += loc_sum;
  }
}

The variable sum is quite different for executing in serial vs multiple threads. I am expecting a slight difference due to the calls to the random uniform distributions, but the difference that I am observing is too significant and can't be due to randomness, and I suspect there's a bug in my multithreaded code.

Are there any race conditions or data races in this code that's affecting sum?

Solution

The issue is that you call the generators (dir and xyz) without a lock around them. You're also using the PRNG (gen) without locking.

Neither of those calls is atomic, since implementing them as such by default would make single threaded code slower than it needs to be.

Marking the lines to generate d and p with #pragma omp critical should fix the issue.

If you don't want a critical section, you'll need seperate dir, xyz, and gen objects in each thread. The generators (dir and xyz) can simply be copied. The PRNG (gen) should be properly initialized per thread, otherwise you end up with PNGs with the exact same state in each thread. E.g.:

std::random_device rd; /* Outside the parallel section. */

// Code below once per thread.
/* Initialization of the PRNG calls std::random_device::operator() which 
 * needs a lock around it when called in parallel. */
std::mt199937 gen;    
#pragma omp critical
{
  gen.seed(rd());
}

// for-loop starts here.