Search code examples
databasedistributed-system

Identical messages committed during a network partition


I'm working on a distributed database. I'm in a situation where, during the healing of a partition (nodes are beginning to recognize the ones that they were split from) two different clients try and commit a Compare-and-Set of 3 to 4, and both are successful. Logically, this should not be possible, but I'm curious if there is any functional problem with both returning successful. Both clients correctly believe what the final state is, and the command that they sent out was successful. I can't think of any serious problems. Are there any?


Solution

  • The "standard" definition of CAS (to the extent that there is such a thing?) guarantees that at most one writer will see a successful response for a particular transition. A couple examples that depend on this guarantee:

    // generating a unique id
    while (true) {
      unique_id = read(key)
      if (compare_and_set(key, unique_id, unique_id + 1)) {
        return unique_id
      }
    }
    

    If two clients both read 3 and successfully execute compare_and_set(key, 3, 4), they'll both think they've "claimed" 3 as their unique id and may end up colliding down the road.

    // distributed leases/leader election
    while (true) {
      locked_until = read(key)
      if (locked_until < now()) {
        if (compare_and_set(key, locked_until, now() + TEN_MINUTES)) {
          // I'm now the leader for ~10 minutes.
          return;
        }
      }
      sleep(TEN_MINUTES)
    }
    

    Similar problem here: if two clients see that the lock is available and both successfully CAS to acquire it, they'll both believe that they are the leader at the same time.