I'm using an open source off-heap cache implementation, ohc.
Recently, I found a pull request on github which suggests to
replace spin-on-compare-and-set with spin-on-read.
Here is the change of code, it adds only one line while(lockFieldUpdater.get(this) != 0L)
, which like
while (true)
{
if (lockFieldUpdater.compareAndSet(this, 0L, t))
return true;
// while(lockFieldUpdater.get(this) != 0L)
Thread.yield();
}
I compile it and use the benchmark tool to test it:
Then I use it on production, the original average time cost of reading is about 35,000 nanoseconds, and it only cost 10,000 nanoseconds with the new version.
What's the difference of these two implementations? Why test-on-read is much more faster in this case?
In order to understand why the performance improves, it is good to know a little bit about cache coherency protocols. The original version relies solely on a heavy-handed read-modify-write operation and yields if it fails. It's heavy-handed because the CAS operation will generate quite a bit of cache coherence traffic obtaining ownership of the cacheline, invalidating the copies on the other cores, etc. This naive approach leads to a whole lot of contention as the number of threads increases.
The modified version is an improvement over the naive approach because it synchronizes the threads' actions a bit better. By making sure each thread will spin on its own cached copy, it's only once the local copy has been invalidated (modified in another core), that the thread will once again be allowed to attempt a CAS.
This is very similar to why TATAS locks are an improvement over naive TAS locks.
As to why your local benchmarks show a ~6% speedup while your production server sees ~3.5x speedups can probably be explained for a few reasons.