java scala apache-spark hbase distributed-computing

Spark - change parallelism during execution

I have a job divided in two parts:

The first part retrieves data from HBase using Spark
The seoncd part computes heavy CPU intensive ML algorithms

The issue is that with high number of executors/cores, the HBase cluster is too aggresively queried and this may cause production unstability. With too few executors/cores, the ML computations takes a long time to perform.

As the number of executors and cores is set at startup, I would to know if there is a way to decrease executor number for the first part of the job.

I would obviously like to avoid running two separate jobs like Hadoop would do with mandary disk serialization between these two steps.

Thanks for your help

Solution

I guess dynamic allocation is what you are looking for. This something you can use with spark streaming as well.

I think you may have to play a little with your RDD size as well to balance data ingestion and data processing but depending on what's your real use case it can be really challenging.