Search code examples
airflowgoogle-cloud-dataprocorchestrationgoogle-cloud-dataproc-serverless

Use Google Cloud Workflows to trigger Dataproc Batch job


My scenario demands an orchestration since the jobs in a flow (say a DAG) are connected/codependent. Cloud Composer is too expensive since we only have a few jobs to run (does not worth it).

I've been looking around and looks like Google Cloud Workflows can help me on orchestrating my workflows/DAGs.

But I couldn't be able to find any documentation or example where I can trigger a Dataproc Batch job from the Worklows YAML file.

Triggering a Function that will trigger a Dataproc batch job using the SDK is not an option since (as I said) I need to control the end of a task to be able to start a different one. Using Functions I wouldn't be able to have such control.

Do you have any idea on how to (and if its possible to) create a Dataproc Batch job using a Google Cloud Workflow?


Solution

  • yes its possible! Since workflows are capable of http.post requests you can use the REST API for Dataproc batches (here).

    Subsequently use http.get to await the execution of the dataproc job by polling its "state" until "SUCCEEDED" or otherwise