How to use openmp reduction with two dimensional vector

I want to parallelised this for loop in c++ with openmp:

   #pragma omp parallel for reduction(+:A)
   for (uint k = 0; k < nInd; ++k)
        {
            for (uint l = k; l < nInd; ++l)
            {
                    A.at(k).at(l) = A.at(k).at(l) + frq.at(l);
                    A.at(l).at(k) = A.at(k).at(l);        
            }
        }

But openmp does not know how to add the A which is std::vector<std::vector<float>>.

How can I solve this issue?

The error is: user defined reduction not found for ‘A’

Solution

I can't tell you much about the error, I can tell you that what you're doing is not what you should be doing:

Using the bounds-checking A.at(i).at(j) instead of the raw A[i][j] is a bad idea here: you know the length stay constant at the entry of this loop, so just check the lengths once, and then use the faster []. It's especially bad because with at, openmp potentially has to take precautions otherwise unnecessary (exceptions in multi-threaded code are always a bit ... special), and you've got concurrent read access to the vector lengths all the time instead of independent access to the raw memory. Most importantly, however:

the inner loop changes the l. row. This means the order of rows being processed by the inner loop matters! That means you cannot accept parallelization. If openmp was allowed to fully parallelize the operations on the rows, this will invariably lead to situations where other threads access memory the individual thread is still read-accessing.

I don't see a good solution for this loop. Maybe you want to make two loops out of it: One that just adds the frq values, and one that transposes your matrix. Or, write to a separate matrix instead of doing this in-place in A (whether that works is a question of memory sizes).

My general recommendation is that you never actually transpose matrices (unless you really need to ensure a memory access pattern); that's just a waste of CPU. Whatever uses the transposed matrix could also swap its row and column indices at no extra computational cost and you get the transposition "for free".

Another recommendation: there's good C++ libraries for linear algebra. Use them. You get well-parallelized (-able) functions, and often, also things like views on data, where transposing does nothing but just change the way addressing works (i.e., is free, basically).

std::vector<std::vector<float>> implies you're doing two memory indirections per element access, which is comparably expensive. Your CPU needs much much less time to multiply two floating point numbers than it needs to fetch an address from memory that's not already in cache – hence, you'd really want your complete matrix to be in one continous piece of memory, not distributed around. If you have a use case for openmp, you're investing a lot of effort into making your code run faster. Instead of doing the hard thing (parallelizing) first, you should start by using a better way of representing matrices through std::vector<std::vector<float>>, and accessing them with .at(i).at(j).