Search code examples
c++openmphpc

How to avoid reinitializing a vector every time though an openmp for loop?


I have a for loop that looks like this:

for (int i = 0;i<N;i++) {
    vector<double> vec;
    //then do work on vec, such as resize or push_back
}

This is inefficient because every time through the loop, vec is resized, and this may force a dynamic memory allocation every time through the for loop. So an easy optimization is:

vector<double> vec;
for (int i = 0;i<N;i++) {
    vec.clear();
    //then do work on vec, such as resize or push_back
}

This is faster because clear does not deallocate the memory in vec, and so we don't have to deallocate and reallocate the memory every time.

But what if I want to parallelize the for loop with openmp? I can't have all the threads share that one vector 'vec'. So seems like I need to go back to the first option and reinitialize the vector every time through the loop, like this:

#pragma omp parallel for
for (int i = 0;i<N;i++) {
    vector<double> vec;
    //then do work on vec, such as resize or push_back
}

Is there a way to avoid this inefficiency and avoid reallocating the vector every time? Would it be safe to do something like this?

vector<vector<double>> outervec;
outervec.resize(omp_get_max_threads());

#pragma omp parallel for shared(outervec)
for (int i = 0;i<N;i++) {
    int tid = omp_get_thread_num();
    vector<double> &vec = outervec[tid];
    vec.clear();
    //then do work on vec, such as resize or push_back
}

When vec is resized, it might become quite large, and the number of times through the for loop N might be large as well. Allocating memory in vectors is slow when it is done many times for large chunks of memory. The idea is to try to avoid having to deallocate and then reallocate the dynamically allocating memory stored in vec every time. The concern is not the stack allocated memory footprint of the vector object (which is small and quickly allocated), but the heap allocated memory that belongs to the vector object.


Solution

  • It's actually very simple. #pragma omp parallel for is a compound statement - you can split that:

    #pragma omp parallel
    {
        std::vector<double> vec;
        #pragma omp for
        for (int i = 0;i<N;i++) {
            vec.clear();
            //then do work on vec, such as resize or push_back
        }
    }
    

    This will work just fine as you expect.

    This is a clear case, where the "clean" solution of initializing the vector for each loop has performance penalty that will often enough be relevant..

    Sometimes you might want to use the #pragma omp threadprivate directive, if this "cache"-vector is a global / static variable.

    Your suggested solution:

    std::vector<std::vector<double>> outervec;
    outervec.resize(omp_get_max_threads());
    
    #pragma omp parallel for shared(outervec)
    for (int i = 0;i<N;i++) {
        int tid = omp_get_thread_num();
        auto& vec = outervec[tid];
        vec.clear();
        //then do work on vec, such as resize or push_back
    }
    

    will work but has another huge performance issue. This will very likely introduce false-sharing - the pointers used by multiple std::vector instances are stored in the same cache line. If theses are modified frequently, i.e, with a push_back, performance will suffer. This could easily be worse than the "clean" solution.

    If you must, for some reason, bring in the vector from the outside. Make private copies with firstprivate, i.e.:

    std::vector<double> vec;
    #pragma omp parallel for firstprivate(vec)
    for (int i = 0;i<N;i++) {
        vec.clear();
        //then do work on vec, such as resize or push_back
    }
    

    Don't use private(vec) as it leaves the variables uninitialized, and vec.clear() would explode.