posix: interprocess lock abandoned, is there a better way?

I'm coding on AIX, but looking for a general 'nix solution, posix compliant ideally. Can't use anything in C++11 or later.

I have shared memory with many threads from many processes involved. The data in shared memory has to stay self-consistent, so I need a lock, to get everyone to take turns.

Processes crashing with the lock is a thing, so I have to be able to detect an abandoned lock, fix (aka reset) the data, and move on. Twist: deciding the lock is abandoned by waiting for it for some fixed period is not a viable solution.

A global mutex (either living in shared memory, or named) appears not to be a solution. There's no detection mechanism for abandonment (except timing) and even then you can't delete and reform the mutex without risking undefined behaviour.

So I opted for lockf() and a busy flag - get the file lock, set the flag in shared memory, do stuff, unset the flag, drop the lock. On a crash with the lock owned, the lock is automatically dropped, and the next guy to get it can see the busy flag is still set, and knows he has to clean up a mess.

This doesn't work - because lockf() will keep threads from other processes out, but it has special semantics for other threads in your own process. It lets them through unchecked.

In the end I came up with a two step solution - a local (thread) mutex and a file lock. Get the local mutex first; now you're the only thread in this process doing the next step, which is lockf(). lockf() in turn guarantees you're the only process getting through, so now you can set the busy flag and do the work. To unlock, go in reverse order: clear the busy flag, drop the file lock, drop the mutex lock. In a crash, the local mutex vanishes when the process does, so it's harmless.

Works fine. I hate it. Using two locks nested like this strikes me as expensive, and takes a page worth of comments in the code to explain. (My next code review will be interesting). I feel like I missed a better solution. What is it?

Edit: @Matt I probably wasn't clear. The busy flag isn't part of the locking mechanism; it's there to indicate when some process successfully acquired the lock(s). If, after acquiring the locks, you see the busy flag is already set, it means some other process got the locks and then crashed, leaving the shared memory it was in the middle of writing to in an incomplete state. In that case the thread now in possess of the lock gets the job of re-initializing the shared memory to a usable state. I probably should have called it a "memoryBeingModified" flag.

No variation of "tryLock" is going to be permissible. Polling is absolutely out of the question in this application. Threads that need to modify shared memory may only block on the locks (which are never held long) and have to take their turn as soon as the lock is available to them. They have to experience the minimum possible delay.

Solution

The simple answer is, there's no good solution. On AIX, lockf turns out to be extremely slow, for no good reason. But mutexes in shared memory, while very fast on any platform, are fragile (anyone can crash while holding the lock and there's no recovery for that.) It would be nice is posix defined a "this mutex is held by a thread/process that died ", but it doesn't and even if there was such an error code, there's no way to repair things and continue. Using shared memory with multiple readers and writers continues to be the wild west.