I wrote a simple lock-free node stack (Delp[hi XE4, Win7-64, 32-bit app) where I can have multiple 'stacks' and pop/push nodes between them from various threads simultaneously. It works 99.999% of the time but eventually glitch under a stress test using all CPU cores.
Stripped-down, it comes down to this (not real/compiled code): Nodes are :
type POBNode = ^TOBNode;
[volatile]TOBNode = record
[volatile]Next : POBNode;
Data : Int64;
end;
Simplified stack :
type TOBStack = class
private
[volatile]Head:POBNode;
function Pop:POBNode;
procedure Push(NewNode:POBNode);
end;
procedure TOBStack.Push(NewNode:POBNode);
var zTmp : POBNode;
begin;
repeat
zTmp:=InterlockedCompareExchangePointer(Pointer(Head),nil,nil);(memory fenced-read*)
NewNode.Next:=zTmp;
if InterlockedCompareExchangePointer(Head,NewNode,zTmp)=zTmp
then break (*success*)
else continue;
until false;
end;
function TOBStack.Pop:POBNode;
begin;
repeat
Result:=InterlockedCompareExchangePointer(Pointer(Head),nil,nil);(memory fenced-read*)
if Result=nil
the exit;
NewHead:=Result.Next;
if InterlockedCompareExchangePointer(Pointer(Head),NewHead,Result)=Result
then break (*Success*)
else continue;(*Fail, try again*)
until False;
end;
I have tried many variations on this but cannot get it to be stable. If I create a few threads that each have a stack and they all push/pop to/from a global stack, it eventually glitch, but not quickly. Only when I stress it for minutes on end from multiple threads, in tight loops.
I cannot have hidden bugs in my code, so I need more advice than to ask a specific question to get this running 100% error-free, 24/7. Does the code above look fine for a lock-free thread-safe stack? What else can I look at? This is impossible to debug normally as the errors occur at various places, telling me there is a pointer or RAM corruption happening somewhere. I also get duplicate entries, meaning that a node that was popped of one stack, then later returned to that same stack, is still on top of the old stack... Impossible to happen according to my algorithm? This lead me to believe it's possible to violate Delphi/Windows InterlockedCompareExchange Methods or there is some hidden knowledge I have yet to be revealed. :) (I also tried TInterlocked)
I have made a complete test case that can be copied from ftp.true.co.za. In there I run 8 threads doing 400 000 push/pops each and it usually crashes (safely due to checks/raised exceptions) after a few cycles of these tests, sometimes many many test cycles complete before one suddenly crash.
Any advice would be appreciated.
Regards Anton E
At first I was skeptical of this being an ABA problem as indicated by gabr. It seemed to me that: if one thread looks at the current Head
, and another thread pushes then pops; you're happy to still operate on the same Head
in the same way.
However, consider this from your Pop
method:
NewHead:=Result.Next;
if InterlockedCompareExchangePointer(Pointer(Head),NewHead,Result)=Result
NewHead
is stored in a local variable.Next
before the first thread resumes.NewHead
value from the local variable.NewHead
is incorrect, thereby corrupting your stack.There's a subtle variation on this problem not even covered by your test app. This problem isn't checked in your test app because you aren't destroying any nodes until the end of your test.
Head
.Apart form the above...
Looking at your test app there's also some really dodgy code. E.g.
You generate a "random number": J:=GetTickCount and 7;(*Get a 'random' number 0..7*)
.
GetTickCount
will generate reams of duplicates in a tight loop?You're allocating memory of a hard-coded size: GetMem(zTmp,12);(*Allocate new node*)
.
SizeOf
?Right now, given these two examples, I wouldn't be entirely confident that there isn't also an error in your test code.