Search code examples
c++windowsvisual-c++openmp

Is it safe to use omp_get_thread_num to index a global vector?


I have a code like this:

thread_local CustomAllocator* ts_alloc = nullptr;

struct AllocatorSetup
{
    AllocatorSetup( int threadNum )
    {
        static std::vector<CustomAllocator> vec( (size_t)omp_get_max_threads() );
        ts_alloc = &vec.at( threadNum );
    }
    ~AllocatorSetup()
    {
        ts_alloc->resetArena();
        ts_alloc = nullptr;
    }
    AllocatorSetup() = delete;
    AllocatorSetup( const AllocatorSetup& ) = delete;
    void operator=( const AllocatorSetup& ) = delete;
};

template<class E>
inline E* allocateBuffer( size_t count )
{
    return (E*)ts_alloc->allocate( count * sizeof( E ), alignof( E ) );
}

void computeThings()
{
#pragma omp parallel for
    for( int64_t i = 0; i < 100000; i++ )
    {
        AllocatorSetup allocSetup{ omp_get_thread_num() };
        float* const buffer = allocateBuffer<float>( 1024 * 256 );
        // ..some computations here
    }
}

Is it going to break when multiple threads call computeThings() concurrently?

In other words, at any given time, is the relation between omp_get_thread_num() index and native threads one-to-one or one-to-many?


Solution

  • The documentation spells this out.

    The omp_get_thread_num routine returns the thread number, within the current team, of the calling thread.

    The binding thread set for an omp_get_thread_num region is the current team. The binding region for an omp_get_thread_num region is the innermost enclosing parallel region.

    The omp_get_thread_num routine returns the thread number of the calling thread, within the team that is executing the parallel region to which the routine region binds. The thread number is an integer between 0 and one less than the value returned by omp_get_num_threads, inclusive. The thread number of the master thread of the team is 0. The routine returns 0 if it is called from the sequential part of a program.

    Emphasis mine.

    This means that if you have multiple teams executing concurrently, the threads in each team are numbered 0 .. nbr_threads_in_team. Multiple threads will therefore get the same thread number if they are in different concurrently running teams.

    While OpenMP certainly re-uses threads, if it ever actually runs two parallel sections concurrently, you will have two teams of threads running at the same time. Since a single thread can't do two things at once, these threads must be different but yet the threads in each team are numbered 0 .. nbr_threads_in_team.

    This means that your computeThings() function is not thread-safe at the moment. You could instead build a thread-safe pool of CustomAllocator objects that a thread can "lend" an object from and return it when the thread is done with it.

    Something like this:

    void computeThings() {
    #pragma omp parallel
        {
            //Since we're in a parallel block, each thread
            //will lend its own CustomAllocator
            CustomAllocator& theAllocator = allocatorPool.lend();
    
    #pragma omp for
            for (...) {
                /* Do stuff with the allocator */
            }
            
            allocatorPool.unLend(theAllocator);
        }
    }
    

    In production code, lend() and unLend() should of course be implemented using RAII to avoid leaks in case of an exception. Note that it only acquires new resources once before the entire parallel for loop, not in each iteration of the loop, so you only pay for the cache misses once per call. (It's unlikely that your data stays in L1 in between calls to computeThings unless you're calling that function itself in a loop.) If you do need to be nice to the L1 even across multiple invocations of the function, you could associate each resource with a thread ID and preferentially return resources to the same thread that requested them previously if the "preferred resource" is still available.

    Here's an example of such a cache that tries to give each requesting thread its preferred resource instance:

    //Warning, untested!
    template<typename T>
    class ThreadAffineCache {
        std::unordered_map<std::thread::id, std::unique_ptr<T>> m_cache = {};
        std::mutex m_mtx{};
        
    public:
        std::unique_ptr<T> lend() {
            std::unique_lock lk{m_mtx};
            auto tid = std::this_thread::get_id();
            std::unique_ptr<T> result = std::move(m_cache[tid]);
            m_cache.erase(tid);
            if (!result) {
                if (m_cache.empty()) {
                    result = std::make_unique<T>();
                } else {
                    auto iter = m_cache.begin();
                    result = std::move(*iter);
                    m_cache.erase(iter);
                }
            }
            assert(result);
            return result;
        }
        
        void unLend(std::unique_ptr<T> obj) {
            assert(obj);
            std::unique_lock lk{m_mtx};
            m_cache[std::this_thread::get_id()] = std::move(obj);
        }
    }
    

    Alternatively, you could use a map indexed by the value returned by pthread_self or gettid. I'm not sure if std::thread::id returns something sensible for OpenMP worker threads. Cleaning up when threads terminate might also be a problem.

    If you just need an easy way out, you could also protect computeThings with a mutex.