I've read at the dask distributed
documentation that:
Worker and Scheduler nodes operate concurrently. They serve several overlapping requests and perform several overlapping computations at the same time without blocking.
I've always thought single-thread concurrent programming is best suited for I/O expensive, not CPU-bound jobs. However I expect many dask tasks (e.g. dask.pandas
, dask.array
) to be CPU intensive.
Does distributed only use Tornado for client/server communication, with separate processes/threads to run the dask tasks? Actually dask-worker
has --nprocs
and --nthreads
arguments so I expect this to be the case.
How do concurrency with Tornado coroutines and more common processes/threads processing each dask task live together in distributed?
You are correct.
Each distributed.Worker object contains a concurrent.futures.ThreadPoolExecutor with multiple threads. Tasks are run on this ThreadPoolExecutor
for parallel performance. All communication and coordination tasks are managed by the Tornado IOLoop.
Generally this solution allows computation to happen separately from communication and administration. This allows parallel computing within a worker and allows workers to respond to server requests even while computing tasks.
When you make the following call:
dask-worker --nprocs N --nthreads T
It starts N
separate distributed.Worker
objects in separate Python processes. Each of these workers has a ThreadPoolExecutor with T
threads.