c++multithreading visual-c++atomic stdatomic

Busy polling std::atomic - msvc optimizes loop away - why, and how to prevent?

I'm trying to implement a simple busy loop function.

This should keep polling a std::atomic variable for a maximum number of times (spinCount), and return true if the status did change (to anything other than NOT_AVAILABLE) within the given tries, or false otherwise:

// noinline is just to be able to inspect the resulting ASM a bit easier - in final code, this function SHOULD be inlined!
__declspec(noinline) static bool trySpinWait(std::atomic<Status>* statusPtr, const int spinCount)
{
    int iSpinCount = 0;
    while (++iSpinCount < spinCount && statusPtr->load() == Status::NOT_AVAILABLE);
    return iSpinCount == spinCount;
}

However, it seems that MSVC just opitmizes the loop away on Release mode for Win64. I'm pretty bad with Assembly, but doesn't look to me like it's ever even trying to read the value of statusPtr at all:

int iSpinCount = 0;
000000013F7E2040  xor         eax,eax  
    while (++iSpinCount < spinCount && statusPtr->load() == Status::NOT_AVAILABLE);
000000013F7E2042  inc         eax  
000000013F7E2044  cmp         eax,edx  
000000013F7E2046  jge         trySpinWait+12h (013F7E2052h)  
000000013F7E2048  mov         r8d,dword ptr [rcx]  
000000013F7E204B  test        r8d,r8d  
000000013F7E204E  je          trySpinWait+2h (013F7E2042h)  
    return iSpinCount == spinCount;
000000013F7E2050  cmp         eax,edx  
000000013F7E2052  sete        al

My impression was that std::atomic with std::memory_order_sequential_cst creates a compiler barrier that should prevent something like this, but seems that's not the case (or rather, my understanding was probably wrong).

What am I doing wrong here, or rather - how can I best implement that loop without having it optimized away, with least impact on overall performance?

I know I could use #pragma optimize( "", off ), but (other than in the example above), in my final code I'd very much like to have this call inlined into a larger function for performance reasons. seems that this #pragma will generally prevent inlining though.

Appreciate any thoughts!

Thanks

Solution

but doesn't look to me like it's ever even trying to read the value of statusPtr at all

It does reload it on every iteration of the loop:

000000013F7E2048  mov         r8d,dword ptr [rcx] # rcx is statusPtr

My impression was that std::atomic with std::memory_order_sequential_cst creates a compiler barrier that should prevent something like this,

You do not need anything more than std::memory_order_relaxed here because there is only one variable shared between threads (even more, this code doesn't change the value of the atomic variable). There are no reordering concerns.

In other words, this function works as expected.

You may like to use PAUSE instruction, see Benefitting Power and Performance Sleep Loops.