c++parallel-processing openmp eigen race-condition

Data race with OpenMP and Eigen but data is read-only?

I am trying to find the data race in my code but I just can't seem to grasp why it happens. The data in the threads is used read-only and the only variable that is written to is protected by a critical region.

I tried using the Intel Inspector but I am compiling with g++ 9.3.0 and apparently even the 2021 version can't deal with the OpenMP implementation for it. The release notes do not explicitly state it as exception as it was for older versions but there is a warning about false positives because it is not supported. It also always shows a data race for the pragma statements which isn't helpful at all.

My current suspects are either Eigen or the fact that I use a reference to a std::vector. Eigen itself I compile with EIGEN_DONT_PARALLELIZE to not mess with nested parallelism although I think I don't use anything that would use it anyway.

Edit: Not sure if it is really a "data race" (or wrong memory access?) but the example produces non-deterministic output in the form of that the result differs for the same input. If this happens the loop in the main breaks. With more than one thread this happens early (after 5-12 iterations usually). If I run it with one thread only or compile without OpenMP, I have to manually end the example program.

Minimal (not) working example below.

#include <Eigen/Dense>
#include <vector>
#include <iostream>

#ifdef _OPENMP
#include <omp.h>
#else
#define omp_set_num_threads(number)
#endif

typedef Eigen::Matrix<double, 9, 1> Vector9d;
typedef std::vector<Vector9d, Eigen::aligned_allocator<Vector9d>> Vector9dList;

Vector9d derivPath(const Vector9dList& pathPositions, int index){
    
    int n = pathPositions.size()-1;
    
    if(index >= 0 && index < n+1){
        
        // path is one point, no derivative possible
        if(n == 0){
            return Vector9d::Zero();
        }
        else if(index == n){
            return Vector9d::Zero();
        }
        // path is a line, derivative is in the direction of start to end
        else {
            return n * (pathPositions[index+1] - pathPositions[index]);
        }
    }
    else{
        return Vector9d::Zero();
    }
}

// ********************************
// data race occurs here somewhere
double errorFunc(const Vector9dList& pathPositions){
    int n = pathPositions.size()-1;
    double err = 0.0;
    
    #pragma omp parallel default(none) shared(pathPositions, err, n)
    {
        double err_private = 0;
        
        #pragma omp for schedule(static)
        for(int i = 0; i < n+1; ++i){
            
            Vector9d derivX_i = derivPath(pathPositions, i);
            
            // when I replace this with pathPositions[i][0] the loop in the main doesn't break
            // (or at least I always had to manually end the program)
            // but it does break if I use derivX_i[0];
            double err_i = derivX_i.norm();
            err_private = err_private + err_i;
        }
        
        #pragma omp critical
        {
            err += err_private;
        }
    }
    
    err = err / static_cast<double>(n);
    return err;
}
// ***************************************

int main(int argc, char **argv){
    
    // setup data
    int n = 100;
    Vector9dList pathPositions;
    pathPositions.reserve(n+1);
    
    double a = 5.0;
    double b = 1.0;
    double c = 1.0;
    
    Eigen::Vector3d f, u;
    
    f << 0, 0, -1;//-p;
    u << 0, 1, 0;
    
    for(int i = 0; i<n+1; ++i){
        double t = static_cast<double>(i)/static_cast<double>(n);
        Eigen::Vector3d p;
        
        double x = 2*t*a - a;
        double z = -b/(a*a) * x*x + b + c;
        
        p << x, 0, z;
        
        Vector9d cam; 
        cam << p, f, u;
        
        pathPositions.push_back(cam);
    }
    
    omp_set_num_threads(8);
    
    //reference value
    double pe = errorFunc(pathPositions);
    
    int i = 0;
    do{
        double pe_i = errorFunc(pathPositions);
        
        // there is a data race
        if(std::abs(pe-pe_i) > std::numeric_limits<double>::epsilon()){
            std::cout << "Difference detected at iteration " << i << " diff:" << std::abs(pe-pe_i);
            break;
        }
        i++;
    }
    while(true);
}

Output for running the example multiple times

Difference detected at iteration 13 diff:1.77636e-15
Difference detected at iteration 1 diff:1.77636e-15
Difference detected at iteration 0 diff:1.77636e-15
Difference detected at iteration 0 diff:1.77636e-15
Difference detected at iteration 0 diff:1.77636e-15
Difference detected at iteration 7 diff:1.77636e-15
Difference detected at iteration 8 diff:1.77636e-15
Difference detected at iteration 6 diff:1.77636e-15

As you can see, the difference is minor but there and it doesn't always happen in the same iteration which makes it non-deterministic. There is no output if I run it single threaded as I usually end the program after letting it run for a couple of minutes. Therefore, it has to have to do with the parallelization somehow.

I know I could use a reduction in this case but in the original code in my project I have to compute other things in the parallel region as well and I wanted to keep the minimal example as close to the original structure as possible.

I use OpenMP in other parts of my program too where I am not sure if I have a data race there too but the structure is similar (except that I use #pragma omp parallel for and the collapse statement). I have some variable or vector I write to but it's always either in a critical region or each thread only writes to it's own subset of the vector. Data that is used by multiple threads is always read-only. The read-only data is always a std::vector, a reference to a std::vector or a numerical data type like int or double. The vectors always contain an Eigen type or double.

Solution

There are no race conditions. You are observing a natural consequence of the non-commutative algebra of truncated floating-point representations. (A + B) + C is not always the same as A + (B + C) when A, B, and C are finite-precision floating-point numbers due to rounding errors. 1.77636E-15 x 100 (the absolute error when commenting out err = err / static_cast<double>(n);) in binary is:

0 | 01010101 | 00000000000000000001100
S   exponent           mantissa

As you can see, the error is in the least significant bits of the mantissa, hinting at it being the result of accumulation of rounding errors.

The problem occurs here:

    #pragma omp parallel default(none) shared(pathPositions, err, n)
    {
        ...        
        
        #pragma omp critical
        {
            err += err_private;
        }
    }

The final value of err depends on the order in which the different threads arrive at the critical section and their contributions get added, which is why sometimes you see discrepancy right away and sometimes it takes a couple of iterations.

To demonstrate that it is not an OpenMP problem per se, simply modify the function to read:

double errorFunc(const Vector9dList& pathPositions){
    int n = pathPositions.size()-1;
    double err = 0.0;
    std::vector<double> errs(n+1);
    
    #pragma omp parallel default(none) shared(pathPositions, errs, n)
    {
        #pragma omp for schedule(static)
        for(int i = 0; i < n+1; ++i){
            
            Vector9d derivX_i = derivPath(pathPositions, i);
            
            errs[i] = derivX_i.norm();
        }
    }

    for (int i = 0; i < n+1; ++i)
        err += errs[i];
    
    err = err / static_cast<double>(n);
    return err;
}

This removes the dependency on how the sub-sums are computed and added together and the return value will always be the same no matter the number of OpenMP threads.

Another version only fixes the order in which err_private are reduced into err:

double errorFunc(const Vector9dList& pathPositions){
    int n = pathPositions.size()-1;
    double err = 0.0;
    std::vector<double> errs(omp_get_max_threads());
    int nthreads;
    
    #pragma omp parallel default(none) shared(pathPositions, errs, n, nthreads)
    {
        #pragma omp master
        nthreads = omp_get_num_threads();

        double err_private = 0;

        #pragma omp for schedule(static)
        for(int i = 0; i < n+1; ++i){
            
            Vector9d derivX_i = derivPath(pathPositions, i);
            
            double err_i = derivX_i.norm();
            err_private = err_private + err_i;
        }

        errs[omp_get_thread_num()] = err_private;
    }

    for (int i = 0; i < nthreads; i++)
        err += errs[i];
    
    err = err / static_cast<double>(n);
    return err;
}

Again, this code produces the same result each and every time as long as the number of threads is kept constant. The value may differ slightly (in the LSBs) with different number of threads.

You can't get easily around such discrepancy and only learn to live with it and take precautions to minimise its influence on the rest of the computation. In fact, you are really lucky to stumble upon it in 2021, a year in the post-x87 era, when virtually all commodity FPUs use 64-bit IEEE 754 operands and not in the 1990's when x87 FPUs used 80-bit operands and the result of a repeated accumulation would depend on whether you keep the value in an FPU register all the time or periodically store it in and then load it back from memory, which rounds the 80-bit representation to a 64-bit one.

In the mean time, mandatory reading for anyone dealing with math on digital computers.

P.S. Although it is 2021 and we've been living for 21 years in the post-x87 era (started when Pentium 4 introduced the SSE2 instruction set back in 2000), if your CPU is an x86 one, you can still partake in the x87 madness. Just compile your code with -mfpmath=387 :)