Search code examples
hadoophdfsreplicationclouderacloudera-cdh

hdfs moveFromLocal does not distribute replica blocks across data nodes


I recently upgraded my Cloudera environment from 5.8.x (hadoop 2.6.0, hdfs-1) to 6.3.x (hadoop 3.0.0, hdfs-1) and after some days of data loads with moveFromLocal, i just realized that the DFS Used% of datanode server on which i execute moveFromLocal are 3x more than that of others.

Then having run fsck with -blocks, -locations and -replicaDetails flags over the hdfs path to which i load the data; i observed that replicated blocks (RF=2) are all on that same server and not being distributed to other nodes unless i manually run hdfs balancer.

There is a pertinent question asked a month ago, hdfs put/moveFromLocal not distributing data across data nodes?, which does not really answer any of the questions; the files i keep loading are parquet files.

There was no such a problem in the Cloudera 5.8.x. Is there some new configuration should i make in Cloudera 6.3.x related to replication, rack awareness or something like that?

Any help would be highly appreciated.


Solution

  • According to the HDFS Architecture doc, "For the common case, when the replication factor is three, HDFS’s placement policy is to put one replica on the local machine if the writer is on a datanode..."

    Per the same doc, "Because the NameNode does not allow DataNodes to have multiple replicas of the same block, maximum number of replicas created is the total number of DataNodes at that time."