Search code examples
apache-sparkgoogle-cloud-dataprocdataproc

best way to run 300+ concurrent spark jobs in Dataproc?


I have a Dataproc cluster with 2 worker nodes (n1s2). There is an external server which submits around 360 spark jobs within an hour (with a couple of minutes spacing between each submission). The first job completes successfully but the subsequent ones get stuck and do not proceed at all.

Each job crunches some timeseries numbers and writes to Cassandra. And the time taken is usually 3-6 minutes when the cluster is completely free.

I feel this can be solved by just scaling up the cluster, but would become very costly for me. What would be the other options to best solve this use case?


Solution

  • Running 300+ concurrent jobs on a 2 worker nodes cluster doesn't sound like feasible. You want to first estimate how much resource (CPU, memory, disk) each job needs then make a plan for the cluster size. YARN metrics like available CPU, available memory, especially pending memory would be helpful for identifying the situation where it is lack of resources.