Distcp between/within clusters are Map-Reduce jobs. My assumption was, it copies files on the input split level, helping with copy performance since a file will be copied by multiple mappers working on multiple "pieces" in parallel. However when I was going through the documentation of Hadoop Distcp, it seems Distcp will only work on the file level. Please refer to here: hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html
According to the distcp doc, the distcp will only split the list of files, instead of the files themselves, and give the partitions of list to the mappers.
Can anyone tell how exactly this will work?
For a single file of ~50G
size, 1 map task will be triggered to copy the data since files are the finest level of granularity in Distcp
.
Quoting from the documentation:
Why does DistCp not run faster when more maps are specified?
At present, the smallest unit of work for DistCp is a file. i.e., a file is processed by only one map. Increasing the number of maps to a value exceeding the number of files would yield no performance benefit. The number of maps launched would equal the number of files.
UPDATE
The block locations of the file is obtained from the namenode during mapreduce. On Distcp, each Mapper will be initiated, if possible, on the node where the first block of the file is present. In cases where the file is composed of multiple splits, they will be fetched from the neighbourhood if not available on the same node.