google-cloud-platform pyspark google-cloud-dataproc google-cloud-composer google-cloud-data-fusion

How to schedule Dataproc PySpark jobs on GCP using Data Fusion/Cloud Composer

Hello fellow developers,

I have recently started learning about GCP and I am working on a POC that requires me to create a pipeline that is able to schedule Dataproc jobs written in PySpark. Currently, I have created a Jupiter notebook on my Dataproc cluster and that reads data from GCS and writes it to BigQuery, it's working fine on Jupyter but I want to use that notebook inside a pipeline.

Just like on Azure we can schedule pipeline runs using Azure data factory, Please help me out which GCP tool would be helpful to achieve similar results.

My goal is to schedule the run of multiple Dataproc jobs.

Solution

Yes, you can do that by creating a Dataproc workflow and scheduling it with Cloud Composer, see this doc for more details.

By using Data Fusion, you won’t be able to schedule Dataproc jobs written in PySpark. Data Fusion is a code-free deployment of ETL/ELT data pipelines. As per your requirement, you can directly create and schedule a pipeline to pull data from GCS and load it into BigQuery with Data Fusion.