Search code examples
hadoopapache-sparkgoogle-cloud-dataproc

What is the best way to minimize the initialization time for Apache Spark jobs on Google Dataproc?


I am trying to use a REST service to trigger Spark jobs using Dataproc API client. However, each job inside the dataproc clusters take 10-15 s to initialize the Spark Driver and submit the application. I am wondering if there is an effective way to eliminate the initialization time for Spark Java jobs triggered from a JAR file in gs bucket? Some solutions I am thinking of are:

  1. Pooling a single instance of JavaSparkContext that can be used for every Spark job
  2. Start a single job and run Spark-based processing in a single job

Is there a more effective way? How would I implement the above ways in Google Dataproc?


Solution

  • Instead of writing this logic yourself, you may want to investigate the Spark Job Server: https://github.com/spark-jobserver/spark-jobserver as this should allow you to reuse spark contexts.

    You can write a driver program for Dataproc which accepts RPCs from your REST server and re-use the SparkContext yourself and then submit this driver via the Jobs API, but I personally would look at the job server first.