Search code examples
javamultithreadingthreadpoolexecutorservice

ExecutorService with huge number of tasks


I have a list of files and a list of analyzers that analyze those files. Number of files can be large (200,000) and number of analyzers (1000). So total number of operations can be really large (200,000,000). Now, I need to apply multithreading to speed things up. I followed this approach:

ExecutorService executor = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
for (File file : listOfFiles) {
  for (Analyzer analyzer : listOfAnalyzers){
    executor.execute(() -> {
      boolean exists = file.exists();
      if(exists){
        analyzer.analyze(file);
      }
    });
  }
}
executor.shutdown();
executor.awaitTermination(Long.MAX_VALUE, TimeUnit.SECONDS);

But the problem of this approach is that it's taking too much from memory and I guess there is better way to do it. I'm still beginner at java and multithreading.


Solution

  • Where are 200M tasks going to reside? Not in memory, I hope, unless you plan to implement your solution in a distributed fashion. In meantime, you need to instantiate an ExecutorService that does not accumulate a massive queue. Use with the "caller runs policy" (see here) when you create the service. If you try to put another task in the queue when it's already full, you'll end up executing it yourself, which is probably what you want.

    OTOH, now that I look at your question more conscientiously, why not analyze a single file concurrently? Then the queue is never larger than the number of analyzers. That's what I'd do, frankly, since I'd like a readable log that has a message for each file as I load it, in the correct order.

    I apologize for not being more helpful:

    analysts.stream().map(analyst -> executor.submit(() -> analyst.analyze(file))).map(Future::get);

    Basically, create bunch of futures for a single file, then wait for all of them before you move on.