I have a job divided in two parts:
The issue is that with high number of executors/cores, the HBase cluster is too aggresively queried and this may cause production unstability. With too few executors/cores, the ML computations takes a long time to perform.
As the number of executors and cores is set at startup, I would to know if there is a way to decrease executor number for the first part of the job.
I would obviously like to avoid running two separate jobs like Hadoop would do with mandary disk serialization between these two steps.
Thanks for your help
I guess dynamic allocation is what you are looking for. This something you can use with spark streaming as well.
I think you may have to play a little with your RDD size as well to balance data ingestion and data processing but depending on what's your real use case it can be really challenging.