Search code examples
hadoophdfshadoop2

How does balancer work in HDFS?


Balancer iteratively moves replicas from DataNodes with higher utilization to DataNodes with lower utilization.

Will that affect the concept of Rack awarness ?

For example I have three machines placed in two racks and data is placed by following the concept of rack awarness.

What would happen if I add a new machine to the cluster and run the balancer command?


Solution

  • Rack awareness & data locality is a YARN concept. The HDFS balancer only cares about leveling out the Datanode usage.

    If you have 3 machines, with 3 replicas by default, then every machine could be guaranteed to have 1 replica, therefore with 2 racks, you're practically guaranteed to have rack locality.

    Node locality is more performant than rack awareness, anyway.

    If you have 10 GB intra cluster speeds between nodes, data locality is a moot point. This is why AWS can still reasonably process data in S3, for example, where data locality processing is not available