Search code examples
pythonhadoopgoogle-cloud-platformgoogle-cloud-dataprocgoogle-cloud-storage

Can you trigger Python Scripts from Dataproc?


I am experimenting with GCP. I have a local environment with Hadoop. It consists of files stored on HDFS and a bunch of python scripts which make API calls and trigger pig jobs. These python jobs are scheduled via cron.

I want to understand the best way to do something similar in GCP. I understand that I can use GCS as an HDFS replacement. And that Dataproc can be used to spin up Hadoop Clusters and run Pig jobs.

Is it possible to store these Python scripts into GCS, have a cron like schedule to spin up Hadoop clusters, and point to these Python scripts in GCS to run?


Solution

  • I discovered that you can use Dataproc to run Python scripts through a 'submit pig' job. This job allows you to run Bash scripts, from which you can call Python scripts:

    gcloud dataproc jobs submit pig --cluster=test-python-exec --region=us-central1 -e='fs -cp -f gs://testing_dataproc/main/execution/execute_python.sh file:///tmp/execute_python.sh; sh chmod 750 /tmp/execute_python.sh; sh /tmp/execute_python.sh'