Search code examples
pythonnumpydaskdask-distributed

Why dask shows smaller size than the actual size of the data (numpy array)?


Dask shows slightly smaller size than the actual size of a numpy array. Here is an example of a numpy array that is exactly 32 Mb:

import dask as da
import dask.array
import numpy as np

shape = (1000,4000)
ones_np = np.ones(shape)
print(f"Size:{ones_np.nbytes / 1e6} Mb")
>> Size: 32.0 Mb

However with Dask it shows 30.52:

ones_da = da.array.ones(shape)
ones_da

enter image description here

Tho if I do ones_da.nbytes/1e6 it returns the correct (32 Mb) size.

I thought dask Array size should show the actual size?


Solution

  • The function responsible is here in dask/utils (permalink) and it only supports powers of 2, not 10. This in contrast to the time units immediately below. You could ask for this to be a configurable thing, but someone would have to put in a little work.