I am trying to use a REST service to trigger Spark jobs using Dataproc API client. However, each job inside the dataproc clusters take 10-15 s to initialize the Spark Driver and submit the application. I am wondering if there is an effective way to eliminate the initialization time for Spark Java jobs triggered from a JAR file in gs bucket? Some solutions I am thinking of are:
Is there a more effective way? How would I implement the above ways in Google Dataproc?
Instead of writing this logic yourself, you may want to investigate the Spark Job Server: https://github.com/spark-jobserver/spark-jobserver as this should allow you to reuse spark contexts.
You can write a driver program for Dataproc which accepts RPCs from your REST server and re-use the SparkContext yourself and then submit this driver via the Jobs API, but I personally would look at the job server first.