hash distributed-computing scalability consistent-hashing

How to reliably shard data across multiple servers

I am currently reading up on some distributed systems design patterns. One of the designs patterns when you have to deal with a lot of data (billions of entires or multiple peta bytes) would be to spread it out across multiple servers or storage units.

One of the solutions for this is to use a Consistent hash. This should result in an even spread across all servers in the hash.

The concept is rather simple: we can just add new servers and only the servers in the range would be affected, and if you loose servers the remaining servers in the consistent hash would take over. This is when all servers in the hash have the same data (in memory, disk or database).

My question is how do we handle adding and removing servers from a consistent hash where there are so much data that it can't be stored on a single host. How do they figure out what data to store and what not too?

Example:

Let say that we have 2 machines running, "0" and "1". They are starting to reach 60% of their maximum capacity, so we decide to add an additional machine "2". Now a large part the data on machine 0 has to be migrated to machine 2. How would we automate so this will happen without downtime and while being reliable.

My own suggested approach would be that the service hosing consistent hash and the machines would have be aware of how to transfer data between each other. When a new machine is added, will the consistent hash service calculate the affected hash ranges. Then inform the affect machine of the affected hash range and that they need to transfer affected data to machine 2. Once the affected machines are done transferring their data, they would ACK back to the consistent hash service. Once all affected services are done transferring data, the consistent hash service would start sending data to machine 2, and inform the affected machine that they can remove their transferred data now. If we have peta bytes on each server can this process take a long time. We there for need to keep track of what entires where changes during the transfer so we can ensure to sync them after, or we can submit the write/updates to both machine 0 and 2 during the transfer.

My approach would work, but i feel it is a little risky with all the backs and forth, so i would like to hear if there is a better way.

Solution

How would we automate so this will happen without downtime and while being reliable?

It depends on the technology used to store your data, but for example in Cassandra, there is no "central" entity that governs the process and it is done like almost everything else; by having nodes gossiping with each other. There is no downtime when a new node joins the cluster (performance might be slightly impacted though).

The process is as follow:

The new node joining the cluster is defined as an empty node without system tables or data.

When a new node joins the cluster using the auto bootstrap feature, it will perform the following operations

- Contact the seed nodes to learn about gossip state.
- Transition to Up and Joining state (to indicate it is joining the cluster; represented by UJ in the nodetool status).
- Contact the seed nodes to ensure schema agreement.
- Calculate the tokens that it will become responsible for.
- Stream replica data associated with the tokens it is responsible for from the former owners.
- Transition to Up and Normal state once streaming is complete (to indicate it is now part of the cluster; represented by UN in the nodetool status).

Taken from https://thelastpickle.com/blog/2017/05/23/auto-bootstrapping-part1.html

So when the joining node is in the Joining State, it is receiving data from other nodes but not ready for reads until the process is complete (Up status).

DataStax also has some material on this https://academy.datastax.com/units/2017-ring-dse-foundations-apache-cassandra?path=developer&resource=ds201-datastax-enterprise-6-foundations-of-apache-cassandra