Search code examples
performancehadoophdfshadoop-yarndistcp

Asisstance in reducing execution time of distcp operation


We have many distcp jobs copying data from our primary cluster to our backup cluster. these jobs run all day and copy almost all tables of critical databases. We use webhdfs here.

Some of these jobs run for hours ( for tables (ORC format ones )that are huge .Is there any way we can optimise the distcp operation between two clusters. Any suggestions are welcome.

We tried using bandwidth to speed up. below is the excerpt from our script.

PROP="-Dmapreduce.task.timeout=300000 -Dmapred.job.queue.name=$YARN_QUEUE -Dmapred.job.name="cpy-${jobName}" -bandwidth 800 "

hadoop distcp ${PROP} $1 WEBHDFS://$DESTNAMENODE$2 >> $3 2>&1


Solution

  • Three things I usually look at when tuning distcp performance;

    • the number of mappers used for the distcp operation

    The '-m' option will allow you to specify the number of map tasks used, the maximum number of simultaneous copies so to speak. Try running the copy a couple of times and gradually increase this number to see what works best for your scenario.

    • strategy dynamic

    You can run the DistCp job with a '-strategy dynamic' flag that will “dynamically” size maps enabling the faster or more responsive nodes to copy more data than slower or busy nodes. You can read more about this in the DistCp manual.

    • bandwidth

    Looks like you already experimented with the '-bandwidth' option, but I wanted to mention it here as it is definitely an important factor. Try increasing this even further if your network allows for it.