I have:
I want to run one of the Hadoop utils on master node (hadoop distcp
) programatically. What is the best way to do that? So far I have the next clue: ssh to master node and run util from there. Is there any other option to accomplish the same goal?
To run DistCp you can submit regular Hadoop MR job through Dataproc API or gcloud and specify org.apache.hadoop.tools.DistCp
as a main class:
gcloud dataproc jobs submit hadoop --cluster=<CLUSTER> \
--class=org.apache.hadoop.tools.DistCp -- <SRC> <DST>
From Python you can use either REST API directly or Python Client library to submit DistCp job.