I am new in using #dask for data analysis. I have some questions in how #dask works internally. for example how it performs I/O? like in HDF5 is it done in parallel HDF5 in workers or somewhere else?
Generally, there are two stages for computing anything in Dask:
building the graph of operations, which will include inspecting the file(s) from the client side, to determine number of inputs, chunking, data types, etc., with a minimum of IO
accessing the data chunks from workers, independently and in parallel.
The bulk of the IO happens in the workers.
Exactly what happens for you in your computation will depend on what it is you are doing, and the data you work with. Note that some some file formats are more readily accessed n parallel or on cloud/distributed systems.