Search code examples
google-app-enginegoogle-cloud-platformgoogle-cloud-dataproc

How to run Hadoop utils on Dataproc cluster programmatically?


I have:

  • App Engine application (Java/Python)
  • Dataproc cluster

I want to run one of the Hadoop utils on master node (hadoop distcp) programatically. What is the best way to do that? So far I have the next clue: ssh to master node and run util from there. Is there any other option to accomplish the same goal?


Solution

  • To run DistCp you can submit regular Hadoop MR job through Dataproc API or gcloud and specify org.apache.hadoop.tools.DistCp as a main class:

    gcloud dataproc jobs submit hadoop --cluster=<CLUSTER> \
        --class=org.apache.hadoop.tools.DistCp -- <SRC> <DST>
    

    From Python you can use either REST API directly or Python Client library to submit DistCp job.