Search code examples
hadoopjobsdistcp

How can I list active DISTCP jobs?


How can I list active DISTCP jobs?

I'm running a distcp job between two clusters. hadoop distcp hdfs://x/y /x/y

I want to run this continually but need to make sure existing distcp tasks are complete.

I've tried the following on both source and destination clusters, but I cannot see the copy operation. mapred job -list all


Solution

  • This is basically a variation on Yarn api get applications by elapsedTime. In your case you can use the RM Cluster Applications API to get all the apps (unfortunately it doesn't filter on name), then filter the apps where name equals distcp. The following shows how to filter using jq:

    $ curl 'RMURL/ws/v1/cluster/apps' | jq '.apps.app[] | select (.name == "distcp")'
    

    For your case, if you're only interested in active jobs you would add the states filter to the API call.

    $ curl 'RMURL/ws/v1/cluster/apps?states=NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING' |\
        jq '.apps.app[] | select (.name == "distcp")'
    

    http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Applications_API