cluster-computing apache-zookeeper distributed-computing distributed-system

Will replicated data in cluster use same memory space in every system?

I am just trying to understand the mechanism of distributed system and their data storage mechanism. Most of the distributed system has "replicated data within the cluster to prevent data loss".

But what I am trying to understand is, suppose if I have a cluster of 3 systems with a memory of 1TB each, and the master node is replicating the data in other 2 nodes. And suppose if the master node utilizes 500GB of memory, then the other 2 nodes should utilize the same amount of memory for replicating data.

And if this holds true, then in that case, how should I increase the memory capacity of my cluster. Because in this case, it will hold the same amount of data (i.e. 1TB maximum) even on a cluster which has memory capacity way more than the data it is holding.

Solution

That is true. The way you scale the memory in your cluster is by partitioning the data. So, if you add another node to your three node cluster, then you split the data set up into four partitions (typically by key) and replicate each partition on three nodes. If you add another node, add another partition.

This creates a well known problem in distributed systems, though. When you add a node to a cluster with a data set partitioned by key, the entire data set has to be repartitioned and rebalanced. This can be an impracticality expensive operation, particularly for large clusters that frequently evolve. The way to deal with this problem is with consistent hashing where each partition is hashed to a primary and n backups. Using consistent hashing, then, only neighboring nodes have to be rebalanced when a node is added/removed. For more information on this type of distributed system read the Dynamo paper.