Search code examples
hadoophdfsreplication

how hdfs removes over-replicated blocks


For example I wrote a file into HDFS using replication factor 2. The node I was writing to has now all the blocks of the file. Others copies of all blocks of the file are scattered around all remaining nodes in the cluster. That's default HDFS policy. What exactly happens if I lower replication factor of the file to 1? How HDFS decides which blocks from which nodes to delete? I hope it tries to delete blocks from nodes that have the most count of blocks of the file?

Why I'm asking - if it does, it would make sense - it will alleviate processing of the file. Because if there is only one copy of all blocks and all the blocks are located on the same node, then it would be harder to process the file using map-reduce because of data transferring to other nodes in the cluster.


Solution

  • When a block becomes over-replicated, the name node chooses a replica to remove. The name node will prefer not to reduce the number of racks that host replicas, and secondly prefer to remove a replica from the data node with the least amount of available disk space. This may help rebalancing the load over the cluster.

    Source: The Architecture of Open Source Applications