I read the captioned sentence in dask’s website and wonder what it means. I have extracted the relevant part below for ease of reference:
A common performance problem among Dask Array users is that they have chosen a chunk size that is either too small (leading to lots of overhead) or poorly aligned with their data (leading to inefficient reading).
While optimal sizes and shapes are highly problem specific, it is rare to see chunk sizes below 100 MB in size. If you are dealing with float64 data then this is around (4000, 4000) in size for a 2D array or (100, 400, 400) for a 3D array.
You want to choose a chunk size that is large in order to reduce the number of chunks that Dask has to think about (which affects overhead) but also small enough so that many of them can fit in memory at once. Dask will often have as many chunks in memory as twice the number of active threads.
Does it mean that the same chunk will co-exist at the mother node(or process or thread?) and the child node? Is it not necessary to have the same chunk twice?
PS: I don't quite understand the difference among node, process and thread so I just put all of them there.
In many cases, a dask graph will involve many more chunks than there are threads. This warning is noting that multiple of these chunks per worker might be in memory at the same time. for example, in the job:
avg = dask.array.random(
size=(1000, 1000, 1000), chunks=(10, 1000, 1000)
).mean().compute()
there are 100 chunks, each of which are ~80MB in size, and you should anticipate roughly 80MB * nworkers * 2 to be in memory at once.