Search code examples
distributed-systemsystem-design

How to coordinate calls in multi-primary, multi-regional distributed systems?


I have a multi-primary, multi-regional service, in which a use's resource is effectively spread across multiple regions. Users can mutate items inside of the resource via data plane API calls, and they can mutate the resource itself (configs.) via control plane API calls.

If a user makes a resource mutating call to the control plane in region A, an error needs to be returned to the user if they attempt to make another mutation call to the control plane in region B, while the long-running process from the first call is in progress. Creating a "lock" on their resource in region A and replicating the "lock" to B (eventual consistency) doesn't work, because the user might make a mutation call in region B while the "lock" replication is in flight.

What are some common approaches to handling this problem? My thought is there needs to be a single, "global" region that serves as a coordinator.

For example, say region C is designated the coordinator.

1a) User calls the control plane in region A.

1b) User calls the control plane in region B.

  1. Internally, the service routes their calls to region C.

3a) 1a) wins the lock acquisition race. The lock is created, stored in region C, and successful process start signal is returned to the user.

3b) 1b) loses the lock acquisition race. Error is returned to the user.


Solution

  • In systems with multiple leaders, there are two main strategies:

    1. Solve conflict when the conflict is detected
    2. Don't allow conflicts to happen

    Solving a conflict on detection has two main approaches: solve on write(on detect) and solve on read.

    Solve on write/detect happens when a conflict is detected during data replication. The service has to decide on how to merge conflicting records. This is usually done based on some attributes from those records. For example, Last Write Wins approach looks at timestamps and picks the latest. Overall, this approach has a risk of lost updates.

    Solve on read is tricky. The system returns all conflicting records to the reader and the reader decides on what to do. The tricky part is that in multi-region setup reads are usually happen from one region, hence the system might not even know that a conflict has happened. For example: region A we set X=1 and in region B we set X=2; and before a synchronization happened a reader asks for X from region B - it will get a definitive answer that X=2 and no conflict exists. After a replication happened, the read would return both records (X=1&X=2) for reader to decided. But after the reader decides the conflict, the reader must send the correct value to both regions - which is hard.

    If I had to pick, assuming business rules are met, I would go with the first approach - solve conflicts on write/detect due to complexities for resolve-on-read.

    The second major approach is to "Don't allow conflicts to happen".

    The obvious solution here is to use locks. Clearly, locks across regions are slow and will suffer from region failures. If this is the way, then it makes sense to ask - is it ok if I'll allow certain records to be updated in only one ("home") region. This would be much simpler approach; and I would go with this one if it fits the business requirements.

    We could use a dedicated lock manager in another region. Clearly, that would be a single point of failure. That could be slightly mitigated by having a consensus based lock service deployed across all three regions, so the system would be operational when one of of three regions goes down. The problem here is with latency - all of this will require many cross region calls, and if you already agree on cross region call, then it is easier to allow updates to come to just one region.

    And the last approach to "conflict free bucker" is to use conflict free data structures. Unfortunately, those structures have very limited use cases and they don't fit to many business systems.

    As a quick summary: I would rethink the requirement that updates to the same key may happen in different regions.