c#.net thread-safety lock-free reference-counting

Does this lock-free reference-counting wrapper need additional fences or volatiles?

As a first hands-on adventure into lock-free code (I've only read about it up until now), I'd like to try and build a lock-free reference-counting wrapper for IDisposable classes.

Here is the actual lock-free nested class:

private sealed class Wrapper
{
    public T WrappedObject { get; private set; }
    private int refCount;

    public Wrapper(T objectToWrap)
    {
        WrappedObject = objectToWrap;
        refCount = 1;
    }

    public void RegisterShare()
    {
        Interlocked.Increment(ref refCount);
    }
    public bool TryRegisterShare()
    {
        int prevValue;
        do {
            prevValue = refCount;
            if (prevValue == 0)
                return false;
        } while (prevValue != Interlocked.CompareExchange(ref refCount, prevValue + 1, prevValue));
        return true;
    }
    public void UnregisterShare()
    {
        if (Interlocked.Decrement(ref refCount) <= 0)
            WrappedObject.Dispose();
    }
}

It's a private nested class so I can ensure the methods only get called for the following purposes:

constructor for creating the first strong reference
RegisterShare for duplicating strong references
UnregisterShare for releasing strong references
TryRegisterShare for promoting weak references into strong ones

I believe I got the basic idea, I'm just not sure if this is truly thread-safe. A question that crosses my mind: in TryRegisterShare, is the first assignment to prevValue guaranteed to be larger than zero unless all strong references were released? Do I need some fencing or volatiles?

I don't believe the outer class that handles sharing of the references is important for this, but you can find it here if anyone's interested: https://codepaste.net/zs7nbh

Updates:

Here's the modified code taking into account what @PeterCordes had to say.

private sealed class Wrapper
{
    public T WrappedObject { get; private set; }
    private int refCount;

    public Wrapper(T objectToWrap)
    {
        WrappedObject = objectToWrap;
        refCount = 1;
    }

    public void RegisterShare()
    {
        Interlocked.Increment(ref refCount);
    }
    public bool TryRegisterShare()
    {
        return Interlocked.Increment(ref refCount) > 0;
    }
    public void UnregisterShare()
    {
        if (Interlocked.Decrement(ref refCount) == 0
            && Interlocked.CompareExchange(ref refCount, int.MinValue, 0) == 0)
        {
            WrappedObject.Dispose();
        }
    }
}

Solution

Caveat: I don't know C#, but I do know C++ with std::atomic and atomic stuff in x86 asm, and your code (and the function/method names) seem pretty readable/clear so I think I understand what's going on.

You're implementing something similar to C++11's std::shared_ptr, so you could look at implementations of it for inspiration. (This class is like the back-end control block that separate shared_ptr objects sharing the same pointer refer to.)

If two threads both run UnregisterShare and bring the ref count down below zero, they'll both try to .Dispose(). That's a bug, similar to double-free or double-unlocking. You could check for it, or paper it over by changing the code to == instead of <= so only one thread thread runs .Dispose(). <= looks like the worst of both worlds: misbehaviour that could be hard to identify.

in TryRegisterShare, is the first assignment to prevValue guaranteed to be larger than zero unless all strong references were released?

Barring bugs like failure to call RegisterShare when taking a reference, or double-release bugs, yes I think so. It would be prudent to use if(prevValue <= 0) return false there, to make sure you bail out on double-release situations where something has left the refcount negative.

A cmpxchg loop doesn't seem ideal, but if you unconditionally increment and just check whether you must have started at zero, that could fool other threads. (e.g. this sequence of events:

thread A->Unshare
thread B->TryShare (failure detected, but leaves refcount=1 temporarily)
thread C->TryShare succeeds
thread A->Dispose

I haven't looked at how (the Linux / gcc implementation of) C++11 weak_ptr.lock() implements promotion to a shared_ptr (and I'm curious now!).

I wonder if they use cmpxchg in a loop (in the asm) or if they do some kind of dance that avoids that. e.g. maybe Unshare, upon detecting refcount==0, could use a cmpxchg loop to modify the refcount from zero to -2^31 before calling Dispose. (And if it detects refcount >= in that loop, it stops trying to kill it.) Then I think TryShare would succeed as long as it saw Interlocked.Increment(ref refCount) >= 1, since that means that any running UnShare's cmpxchg hasn't already succeeded, and won't succeed.

For some use-cases, it might not be desirable for TryShare to succeed in that case. You could just have it call Unshare to decrement the count again if it it's increment found the old value was zero.

So I think there needs to be a cmpxchg loop somewhere, but if my logic is correct, you can get away with putting it in the refcount-drops-to-zero path of UnShare instead of in the presumably-hotter TryRegisterShare.

RegisterShare looks safe if you can absolutely guarantee that the refcount wasn't zero beforehand. This should be the case in a thread which already has a ref. For bug-checking, you could check the return value of the increment to catch cases where you accidentally reanimate a dead object (i.e. on that another thread is about to (or may have already) Disposed.

In C++, if multiple threads were reading and writing the same shared_ptr, this assumption could be violated. It would require a std::atomic<std::shared_ptr<int>> foo; for that to be safe, though, and that won't compile because shared_ptr is not trivially constructible.

So C++ protects itself from unlocked concurrent access to the reference wrapper objects (by declaring it Undefined Behaviour); you should, too. Otherwise another thread with a reference to the same shared_ptr object this thread is copying might have called .reset() on it, and so might decrement the refcount in the control block after this thread reads the pointer to the control block, but right before this thread increments the control-block refcount.

I just noticed that the title question is about needing extra fences or volatiles:

The decrement in UnShare needs to become globally visible before anything that .Dispose() does, and it needs to be a release operation to make sure loads/stores into a shared object become globally visible before the decrement of the refcount.

Your code already achieves that, because Interlocked.anything has sequential-consistency semantics in .NET on any architecture (not just on Windows), so nothing can reorder across it in either direction.

AlexRP's blog post about the .NET memory model explains lots of detail about this an related things (Interlocked, Thread.VolatileRead/Write, Volatile.Read/Write, and that simple loads or stores on types up to the width of IntPtr.Size are automatically atomic (but not synchronized).) Fun fact: In Microsoft's implementation for x86, Thread.VolatileRead and Write are surrounded on either side by MFENCE instructions, but the language standard only requires acquire or release semantics (which x86 does for free with no barrier). Mono doesn't use barriers for these functions, but MS can't remove them until everyone's buggy code stops relying on that implementation detail.

Other than the decrement, I think everything else just needs atomicity for your ref-counting, not synchronization of any other operations. The ref-counting doesn't synchronize access to the shared object from multiple threads, so a TryShare/UnShare pair doesn't form a critical section (which needs acquire/release semantics). From the constructor to the final UnShare that results in a .Dispose(), this class is hands-off as far as synchronization. In C++ I think you could use memory_order_relaxed for the RegisterShare increment and the TryShare cmpxchg. On some weakly-ordered architectures, like ARM, this would require fewer barriers. (That's no an option in C#: you only have the sequential-consistency Interlocked operations which require full barriers on ARM. There's no extra cost on x86, though, since locked read-modify-write instructions on x86 are already full memory barriers like MFENCE (Thread.MemoryBarrier))