Search code examples
javamultithreadingjava-streamcompletable-futurefileutils

Best way to parallelize thousands of downloads


I am creating an application in which I have to download thousands of images (~1 MB each) using Java.

I take a list of Album URLs in my REST request, each Album contains multiple number of images.

So my request looks something like:

[
  "www.abc.xyz/album1",
  "www.abc.xyz/album2",
  "www.abc.xyz/album3",
  "www.abc.xyz/album4",
  "www.abc.xyz/album5"
]

Suppose each of these albums have 1000 images, so I need to download 50000 images in parallel.

Right now I have implemented it using parallelStream() but I feel that I can optimize it further.

There are two principle classes - AlbumDownloader and ImageDownloader (Spring components).

So the main application creates a parallelStream() on the list of albums:

albumData.parallelStream().forEach(ad -> albumDownloader.downloadAlbum(ad));

And a parallelStream() inside AlbumDownloader -> downloadAlbum() method as well:

List<Boolean> downloadStatus = albumData.getImageDownloadData().parallelStream().map(idd -> imageDownloader.downloadImage(idd)).collect(Collectors.toList());

I am thinking about using CompletableFuture with ExecutorService but I am not sure what pool size should I use?

Should I create a separate pool for each Album?

ExecutorService executor = Executors.newFixedThreadPool(Math.min(albumData.getImageDownloadData().size(), 1000));

That would create 5 different pools of 1000 threads each, that'll be like 5000 threads which might degrade the performance instead of improving.

Could you please give me some ideas to make it very very fast ?

I am using Apache Commons IO FileUtils to download files by the way and I have a machine with 12 available CPU cores.


Solution

  • Suppose each of these albums have 1000 images, so I need to download 50000 images in parallel.

    It's wrong to think of your application doing 50000 things in parallel. What you are trying to do is to optimize your throughput – you are trying to download all of the images in the shortest amount of time.

    You should try one fixed-sized thread-pool and then play around with the number of threads in the pool until your optimize your throughput – maybe start with double the number of processors. If your application is mostly waiting for network or the server then maybe you can increase the number of threads in the pool but you wouldn't want to overload the server so that it slows to a crawl and you wouldn't want to thrash your application with a huge number of threads.

    That would create 5 different pools of 1000 threads each, that'll be like 5000 threads which might degrade the performance instead of improving.

    I see no point in multiple pools unless there are different servers for each album or some other reason why the downloads from each album are different.