Search code examples
apache-sparkconcurrencystreamingjobs

Spark Direct Stream Concurrent Job Limit


I am running a spark direct stream from kafka where I need to run many concurrent jobs in order to process all the data in time. In spark you can set spark.streaming.concurrentJobs to a number of concurrent jobs you want to run.

What I want to know is a logical way to determine how many concurrent jobs I can run within my given environment. For privacy issues at my company, I cannot tell you the specs that I have, but what I would want to know is which specs are relevant in determining a limit and why?

Of course the alternative is that I could keep increasing it and testing, then adjusting based on results but I would like a more logical approach and I want to actually understand what determines that limit and why.


Solution

  • To test different numbers of concurrent jobs and see the overall execution time is the most reliable method. However, I suppose the best number roughly equals to the value of Runtime.getRuntime().availableProcessors();

    So my advice is to start with that number of available processors, then increase and decrease it by 1,2, and 3. Then make a chart (execution time against the number of jobs) and you'll see the optimal number of jobs.