Search code examples
pythonhadoopgoogle-cloud-platformgoogle-cloud-dataproc

Send a Hadoop Job via gcloud


This is my current Hadoop job.

java -cp `hadoop classpath`:/usr/local/src/jobs/MyJob/tony-cli-0.1.5-all.jar com.linkedin.tony.cli.ClusterSubmitter \
--python_venv=/usr/local/src/jobs/MyJob/mnist_venv.zip \
--src_dir=/usr/local/src/jobs/MyJob/ \
--executes=/usr/local/src/jobs/MyJob/src/mnist_distributed.py \
--conf_file=/usr/local/src/jobs/MyJob/tony.xml \
--python_binary_path=venv/bin/python3.5

How to convert it to a gcloud dataproc jobs submit hadoop job?

I tried:

gcloud dataproc jobs submit hadoop --cluster tony-dev \
  --jar /usr/local/src/jobs/MyJob/tony-cli-0.1.5-all.jar --class com.linkedin.tony.cli.ClusterSubmitter -- \
  --python_venv=/usr/local/src/jobs/MyJob/mnist_venv.zip \
  --src_dir=/usr/local/src/jobs/MyJob/ \
  --executes=/usr/local/src/jobs/MyJob/src/mnist_distributed.py \
  --conf_file=/usr/local/src/jobs/MyJob/tony.xml \
  --python_binary_path=venv/bin/python3.5

I keep getting:

ERROR: (gcloud.dataproc.jobs.submit.hadoop) argument --class: Exactly one of (--class | --jar) must be specified.
Usage: gcloud dataproc jobs submit hadoop --cluster=CLUSTER (--class=MAIN_CLASS | --jar=MAIN_JAR) [optional flags] [-- JOB_ARGS ...]
  optional flags may be  --archives | --async | --bucket | --class |
                         --driver-log-levels | --files | --help | --jar |
                         --jars | --labels | --max-failures-per-hour |
                         --properties | --region
For detailed information on this command and its flags, run:
  gcloud dataproc jobs submit hadoop --help

If I pass:

gcloud dataproc jobs submit hadoop --cluster tony-dev \
  --jar /usr/local/src/jobs/MyJob/tony-cli-0.1.5-all.jar com.linkedin.tony.cli.ClusterSubmitter -- \
  --python_venv=/usr/local/src/jobs/MyJob/mnist_venv.zip \
  --src_dir=/usr/local/src/jobs/MyJob/ \
  --executes=/usr/local/src/jobs/MyJob/src/mnist_distributed.py \
  --conf_file=/usr/local/src/jobs/MyJob/tony.xml \
  --python_binary_path=venv/bin/python3.5

I get:

ERROR: (gcloud.dataproc.jobs.submit.hadoop) unrecognized arguments: com.linkedin.tony.cli.ClusterSubmitter

Reference here.


Solution

  • It was a simple change:

    Changed --jar to --jars and now it works.