Search code examples
data-processingdataprocgoogle-cloud-dataproc-serverless

Dataproc Workflow(ephemeral cluster) or Dataproc Serverless for batch processing?


GCP Dataproc offers both serverless (Dataproc Serverless) & ephemeral cluster (Dataproc Workflow template) for spark batch processing.

If Dataproc serverless can hide infrastructure complexity, I wonder what could be the business usecase for using Dataproc ephemeral cluster via Dataproc workflow for Spark batch processing?


Solution

  • Serverless is superior in most cases because you remove the friction of maintaining complex clusters during the time (and trust me, this might require much more hard work than app development when cluster settings change), but in case you are migrating from another platform with already defined cluster settings and libraries, dataproc cluster might be a better choice. Also if a team needs to use the cluster for other purposes such as analytic tasks with computational notebooks for example, the cluster aproach will be better too.