Multiple threads for content retrieval (GET) and data write on disk

I need to make different GET queries to a server to download a bunch of json files and write each download to disk and I want to launch some threads to speed that up.

Each download and writting of each file takes approximately 0.35 seconds.

I would like to know if, under linux at least (and under Windows since we are here), it is safe to write in parallel to disk and how many threads can I launch taking into account the waiting time of each thread.

If it changes something (I actually think so), the program doesn't write directly to disk. It just calls std::system to run the program wget because it is currently easier to do that way than importing a library. So, the waiting time is the time that the system call takes to return.

So, each writting to disk is being performed by a different process. I only wait that program to finish, and I'm not actually bound by I/O, but by the running time of an external process (each wget call creates and writes to a different file and thus they are completely independent processes). Each thread just waits for one call to complete.

My machine has 4 CPUs.

Some kind of formula to get an ideal number of threads according to CPU concurrency and "waiting time" per thread would be welcome.

NOTE: The ideal solution will be of course to make some performance testing, but I could be banned for the server if I abuse with so many request.

Solution

It is safe to do concurrent file I/O from multiple threads, but if you are concurrently writing to the same file, some form of synchronization is necessary to ensure that the writes to the file don't become interleaved.

For what you describe as your problem, it is perfectly safe to fetch each JSON blob in a separate thread and write them to different, unique files (in fact, this is probably the sanest, simplest design). Given that you mention running on a 4-core machine, I would expect to see a speed-up well past the four concurrent thread mark; network and file I/O tends to do quite a bit of blocking, so you'll probably run into a bottleneck with network I/O (or on the server's ability to send) before you hit a processing bottleneck.

Write your code so that you can control the number of threads that are spawned, and benchmark different numbers of threads. I'll guess that your sweet spot will be somewhere between 8 and 16 threads.