Search code examples
multithreadingopenmpstdatomic

User defined atomic less than


I've been reading and it seems that std::atomic doesn't support a compare and swap of the less/greater than variant.

I'm using OpenMP and need to safely update a global minimum value. I was thinking this would be as easy as using a built-in API. But alas, so instead I'm trying to come up with my own implementation.

I'm primarily concerned with the fact that I don't want to use an omp critical section to do a less than comparison every single time because it may incur significant synchronization overhead for very little gain in most cases.

But in those cases where a new global minima is potentially found (less often), the synchronization overhead is acceptable. I'm thinking I can implement it using the following method. Hoping for someone to advise.

  1. Use an std::atomic_uint as the global minima.
  2. Atomically read the value into thread local stack.
  3. Compare it against the current value and if it's less, attempt to enter a critical section.
  4. Once synchronized, verify that the atomic value is still less than the new one and update accordingly (the body of the critical section should be cheap, just update a few values).

This is for a homework assignment, so I'm trying to keep the implementation my own. Please don't recommend various libraries to accomplish this. But please do comment on the synchronization overhead that this operation can incur or if it's bad, elaborate on why. Thanks.


Solution

  • What you're looking for would be called fetch_min() if it existed: fetch old value and update the value in memory to min(current, new), exactly like fetch_add but with min().

    This operation is not directly supported in hardware on x86, but machines with LL/SC could emit slightly more efficient asm for it than from emulating it with a CAS ( old, min(old,new) ) retry loop.

    You can emulate any atomic operation with a CAS retry loop. In practice it usually doesn't have to retry, because the CPU that succeeded at doing a load usually also succeeds at CAS a few cycles later after computing whatever with the load result, so it's efficient.

    See Atomic double floating point or SSE/AVX vector load/store on x86_64 for an example of creating a fetch_add for atomic<double> with a CAS retry loop, in terms of compare_exchange_weak and plain + for double. Do that with min and you're all set.


    Re: clarification in comments: I think you're saying you have a global minimum, but when you find a new one, you want to update some associated data, too. Your question is confusing because "compare and swap on less/greater than" doesn't help you with that.

    I'd recommend using atomic<unsigned> globmin to track the global minimum, so you can read it to decide whether or not to enter the critical section and update related state that goes with that minimum.

    Only ever modify globmin while holding the lock (i.e. inside the critical section). Then you can update it + the associated data. It has to be atomic<> so readers that look at just globmin outside the critical section don't have data race UB. Readers that look at the associated extra data must take the lock that protects it and makes sure that updates of globmin + the extra data happen "atomically", from the perspective of readers that obey the lock.

    static std::atomic<unsigned> globmin;
    std::mutex globmin_lock;
    static struct Extradata globmin_extra;
    
    void new_min_candidate(unsigned newmin, const struct Extradata &newdata)
    {
        // light-weight early out check to avoid the critical section
        // No ordering requirement as long as globmin is monotonically decreasing with time
        if (newmin < globmin.load(std::memory_order_relaxed))
        {
           // enter a critical section.  Use OpenMP stuff if you want, this is plain ISO C++
           std::lock_guard<std::mutex> lock(globmin_lock);
    
           // Check globmin again, after we've excluded other threads from modifying it and globmin_extra
           if (newmin < globmin.load(std::memory_order_relaxed)) {
               globmin.store(newmin, std::memory_order_relaxed);
               globmin_extra = newdata;
           }
           // else leave the critical section with no update:
           // another thread raced with use *outside* the critical section
    
          // release the lock / leave critical section (lock goes out of scope here: RAII)
        }
        // else do nothing
    }
    

    std::memory_order_relaxed is sufficient for globmin: there's no ordering required with anything else, just atomicity. We get atomicity / consistency for the associated data from the critical section/lock, not from memory-ordering semantics of loading / storing globmin.

    This way the only atomic read-modify-write operation is the locking itself. Everything on globmin is either load or store (much cheaper). The main cost with multiple threads will still be bouncing the cache line around, but once you own a cache line, each atomic RMW is maybe 20x more expensive than a simple store on modern x86 (http://agner.org/optimize/).

    With this design, if most candidates aren't lower than globmin, the cache line will stay in the Shared state most of the time, so the globmin.load(std::memory_order_relaxed) outside the critical section can hit in L1D cache. It's just an ordinary load instruction, so it's extremely cheap. (On x86, even seq-cst loads are just ordinary loads (and release loads are just ordinary stores, but seq_cst stores are more expensive). On other architectures where the default ordering is weaker, seq_cst / acquire loads need a barrier.)