I recently started to learn more about service registries and their usage in distributed architecture.
All the applications providing service registries that I found (etcd, Consul, or Zookeeper) are based on the same model: a master-server/cluster with leader election.
Correct me if I'm wrong but... doesn't this make the architecture less reliable ? In the sense that the master cluster brings a point-of-failure. To circumvent this we could always make a bigger cluster but it's more costly and/or less-performance effective.
My questions here are:
All of those services are based on one whitepaper - Google Chubby(https://ai.google/research/pubs/pub27897). The idea is to have fast and consistent configuration storage for distributed systems. To get there you need to eliminate a single point of failure. How you can do that? You introduce multiple machines storing the same data and also replicate the data. But in that case, considering unreliable network between those machines, how do you make sure that the data is consistent among nodes? You choose one of the nodes from the cluster to be Leader(using distributed leader election algorithm), if nodes have inconsistent values between them, it's a leaders job to pick the "correct" one. It looks like we've returned to a "single point of failure" situation, but in reality if the leader fails, the rest of the cluster just votes and promotes a new leader. So Leader role in those systems is NOT to be a Single point of truth, but rather a Single point of decision making