Search code examples
javamultithreadingmutexcritical-sectionlock-free

Lock-free guard for synchronized acquire/release


I have a shared tempfile resource that is divided into chunks of 4K (or some such value). Each 4K in the file is represented by an index starting from zero. For this shared resource, I track the 4K chunk indices in use and always return the lowest indexed 4K chunk not in use, or -1 if all are in use.

This ResourceSet class for the indices has a public acquire and release method, both of which use synchronized lock whose duration is about like that of generating 4 random numbers (expensive, cpu-wise).

Therefore as you can see from the code that follows, I use an AtomicInteger "counting semaphore" to prevent a large number of threads from entering the critical section at the same time on acquire(), returning -1 (not available right now) if there are too many threads.

Currently, I am using a constant of 100 for the tight CAS loop to try to increment the atomic integer in acquire, and a constant of 10 for the maximum number of threads to then allow into the critical section, which is long enough to create contention. My question is, what should these constants be for a moderate to highly loaded servlet engine that has several threads trying to get access to these 4K chunks?

public class ResourceSet {

    // ??? what should this be
    // maximum number of attempts to try to increment with CAS on acquire
    private static final int    CAS_MAX_ATTEMPTS = 50;

    // ??? what should this be
    // maximum number of threads contending for lock before returning -1 on acquire
    private static final int    CONTENTION_MAX = 10;

    private AtomicInteger        latch = new AtomicInteger(0);

    ... member variables to track free resources

    private boolean aquireLatchForAquire ()
    {
        for (int i = 0; i < CAS_MAX_ATTEMPTS; i++) {
            int val = latch.get();
            if (val == -1)
                throw new AssertionError("bug in ResourceSet");        // this means more threads than can exist on any system, so its a bug!
            if (!latch.compareAndSet(val, val+1))
                continue;
            if (val < 0 || val >= CONTENTION_MAX) {
                latch.decrementAndGet();
                // added to fix BUG that comment pointed out, thanks!
                return false;
            }
        }
        return false;
    }

    private void aquireLatchForRelease ()
    {
        do {
            int val = latch.get();
            if (val == -1)
                throw new AssertionError("bug in ResourceSet");    // this means more threads than can exist on any system, so its a bug!
            if (latch.compareAndSet(val, val+1))
                return;
        } while (true);
    }

    public ResourceSet (int totalResources)
    {
        ... initialize
    }

    public int acquire (ResourceTracker owned)
    {        
        if (!aquireLatchForAquire())
            return -1;

        try {
            synchronized (this) {
                ... algorithm to compute minimum free resoource or return -1 if all in use
                return resourceindex;
            }
        } finally {
            latch.decrementAndGet();
        }
    }

    public boolean release (ResourceIter iter)
    {
        aquireLatchForRelease();
        try {
            synchronized (this) {
                ... iterate and release all resources
            }
        } finally {
            latch.decrementAndGet();
        }
    }
}

Solution

  • Writting a good and performant spinlock is actually pretty complicated and requires a good understanding of memory barriers. Merely picking a constant is not going to cut it and will definitely not be portable. Google's gperftools has an example that you can look at but is probably way more complicated then what you'd need.

    If you really want to reduce contention on the lock, you might want to consider using a more fine-grained and optimistic scheme. A simple one could be to divide your chunks into n groups and associate a lock with each group (also called stripping). This will help reduce contention and increase throughput but it won't help reduce latency. You could also associate an AtomicBoolean to each chunk and CAS to acquire it (retry in case of failure). Do be careful when dealing with lock-free algorithms because they tend to be tricky to get right. If you do get it right, it could considerably reduce the latency of acquiring a chunk.

    Note that it's difficult to propose a more fine-grained approach without knowing what your chunk selection algorithm looks like. I also assume that you really do have a performance problem (it's been profiled and everything).

    While I'm at it, your spinlock implementation is flawed. You should never spin directly on a CAS because you're spamming memory barriers. This will be incredibly slow with any serious amount of contention (related to the thundering-herd problem). A minimum would be to first check the variable for availability before your CAS (simple if on a no barrier read will do). Even better would be to not have all your threads spinning on the same value. This should avoid the associated cache-line from ping-pong-ing between your cores.

    Note that I don't know what type of memory barriers are associated with atomic ops in Java so my above suggestions might not be optimal or correct.

    Finally, The Art Of Multiprocessor Programming is a fun book to read to get better acquainted with all the non-sense I've been spewing in this answer.