Search code examples
hadoophdfscluster-computingdistcp

Does Hadoop Distcp copy at block level?


Distcp between/within clusters are Map-Reduce jobs. My assumption was, it copies files on the input split level, helping with copy performance since a file will be copied by multiple mappers working on multiple "pieces" in parallel. However when I was going through the documentation of Hadoop Distcp, it seems Distcp will only work on the file level. Please refer to here: hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html

According to the distcp doc, the distcp will only split the list of files, instead of the files themselves, and give the partitions of list to the mappers.

Can anyone tell how exactly this will work?

  • additional question: if a file is assigned to only one mapper, how does the mapper find all the input splits on one node that it's running on?

Solution

  • For a single file of ~50G size, 1 map task will be triggered to copy the data since files are the finest level of granularity in Distcp.

    Quoting from the documentation:

    Why does DistCp not run faster when more maps are specified?

    At present, the smallest unit of work for DistCp is a file. i.e., a file is processed by only one map. Increasing the number of maps to a value exceeding the number of files would yield no performance benefit. The number of maps launched would equal the number of files.

    UPDATE
    The block locations of the file is obtained from the namenode during mapreduce. On Distcp, each Mapper will be initiated, if possible, on the node where the first block of the file is present. In cases where the file is composed of multiple splits, they will be fetched from the neighbourhood if not available on the same node.