javascript node.js multithreading io node-worker-threads

Node.js performance in file system I/O across multiple disk drives: worker threads or not?

I've read several questions and answers here about the performance benefits of Node.js' ability to performantly handle file I/O operations quickly in a non-blocking way versus using worker threads with either blocking or non-blocking requests, however none seem to answer the question I have.

I'm writing a Node.js application that will be opening, hashing, and writing very large files (multiple gigs) that are stored on multiple hard drives. I'm exploring the idea of worker threads, as they'd allow me to isolate commands to a particular hard drive. For example: assume I have a thread handling copying one file on hard drive A to hard drive B, and another thread handling copying one file from hard drive C to hard drive D.

Assuming I scale this to many more hard drives all at the same time, does it make more sense for me to just use Node.js without worker threads and let it handle all these requests, or does worker threads make more sense if I can isolate I/O by drive, and handle multiple drives' worth of requests at the same time?

Given what I've read, worker threads seem like the obvious solution, but I've also seen that just letting the single Node.js process handle a queue of file I/O is generally faster. Thanks for any guidance you can offer!

Solution

EDIT

Apparently (based on a comment below), nodejs has only one thread pool shared across all the worker threads. If that's the case, then the only way to get a separate pool per disk would be to use multiple processes, not multiple threads.

Or, you could enlarge the worker pool and then make your own queuing system that only puts a couple requests for each separate disk into the worker pool at a time, giving you more parallelism across separate drives.

ORIGINAL ANSWER

(some of which still applies)

Without worker threads, you will have a single libuv thread pool serving all disk I/O requests. So, they will all go into the same pool and once the threads in that pool are busy (regardless of what disk they are serving), new requests will be queued in the order they arrive. This is potentially less than ideal because if you have 5 requests for drive A and 1 request for drive B and 1 request for drive C, you would like to not just fill up the pool with 5 requests for drive A first because that will make the requests for drive B and drive C wait until several requests on drive A are done before they can get started. This loses some opportunities for some parallelism across the separate drives. Of course, whether you truly get parallelism on separate drives also depends upon the drive controller implementation and whether they actually have separate SATA controllers or not.

If you did use worker threads, one nodejs worker thread for each disk, you can at least guarantee that you have a separate pool of OS threads in the thread pool for each disk and you can make it much more likely that no set of requests for one drive will keep the requests for the other drives from getting a chance to start and miss their opportunity to run in parallel with requests to other drives.

Now, of course, all of this discussion is theoretical. In the world of disk drives, controller cards, operating systems on top of the controllers with libuv on top of that with nodejs on top of that, there are lots of opportunities for the theoretical discussion to not bear out in real world measurements.

So, the only way to really know for sure would be to implement the worker thread option and then benchmark compare it to a non-worker thread option with several different disk usage scenarios, including a couple you think might be worst case. So, as with any important performance-related question, you will inevitably have to benchmark and measure to know for sure one way or the other. And, your results will need very careful construction of the benchmark tests too in order to be maximally useful.