Search code examples
pythongoogle-cloud-platformgoogle-cloud-rundataflow

Running large pipelines on GCP


I want to scale on cloud a one off pipeline I have locally.

  1. The script takes data from a large (30TB), static S3 bucket made up of PDFs
  2. I pass these PDFs in a ThreadPool to a Docker container, which gives me an output
  3. I save the output to a file.

I can only test it locally on a small fraction of this dataset. The whole pipeline would take a couple days to run on a MacbookPro.

I've been trying to replicate this on GCP - which I am still discovering.

  • Using Cloud functions doesn't work well because of its max timeout
  • A full Cloud composer architecture seems a bit of an overkill for a very straightforward pipeline which doesn't require Airflow.
  • I'd like to avoid coding this in Apache Beam format for Dataflow.

What is the best way to run such a python data processing pipeline with a container on GCP ?


Solution

  • Thanks to the useful comments in the original post, I explored other alternatives on GCP.

    Using a VM on Compute Engine worked perfectly. The overhead and setup is much less than I expected ; the setup went smoothly.