I'm trying to implement a simple busy loop function.
This should keep polling a std::atomic variable for a maximum number of times (spinCount), and return true if the status did change (to anything other than NOT_AVAILABLE) within the given tries, or false otherwise:
// noinline is just to be able to inspect the resulting ASM a bit easier - in final code, this function SHOULD be inlined!
__declspec(noinline) static bool trySpinWait(std::atomic<Status>* statusPtr, const int spinCount)
{
int iSpinCount = 0;
while (++iSpinCount < spinCount && statusPtr->load() == Status::NOT_AVAILABLE);
return iSpinCount == spinCount;
}
However, it seems that MSVC just opitmizes the loop away on Release mode for Win64. I'm pretty bad with Assembly, but doesn't look to me like it's ever even trying to read the value of statusPtr at all:
int iSpinCount = 0;
000000013F7E2040 xor eax,eax
while (++iSpinCount < spinCount && statusPtr->load() == Status::NOT_AVAILABLE);
000000013F7E2042 inc eax
000000013F7E2044 cmp eax,edx
000000013F7E2046 jge trySpinWait+12h (013F7E2052h)
000000013F7E2048 mov r8d,dword ptr [rcx]
000000013F7E204B test r8d,r8d
000000013F7E204E je trySpinWait+2h (013F7E2042h)
return iSpinCount == spinCount;
000000013F7E2050 cmp eax,edx
000000013F7E2052 sete al
My impression was that std::atomic with std::memory_order_sequential_cst creates a compiler barrier that should prevent something like this, but seems that's not the case (or rather, my understanding was probably wrong).
What am I doing wrong here, or rather - how can I best implement that loop without having it optimized away, with least impact on overall performance?
I know I could use #pragma optimize( "", off ), but (other than in the example above), in my final code I'd very much like to have this call inlined into a larger function for performance reasons. seems that this #pragma will generally prevent inlining though.
Appreciate any thoughts!
Thanks
but doesn't look to me like it's ever even trying to read the value of
statusPtr
at all
It does reload it on every iteration of the loop:
000000013F7E2048 mov r8d,dword ptr [rcx] # rcx is statusPtr
My impression was that
std::atomic
withstd::memory_order_sequential_cst
creates a compiler barrier that should prevent something like this,
You do not need anything more than std::memory_order_relaxed
here because there is only one variable shared between threads (even more, this code doesn't change the value of the atomic variable). There are no reordering concerns.
In other words, this function works as expected.
You may like to use PAUSE
instruction, see Benefitting Power and Performance Sleep Loops.