Search code examples
pythoncsvbigdatadask

Dask object memory size larger than the file size?


I have a csv file that is 15Gb in size according to du -sh filename.txt. However, when I load the file to dask, the dask array is almost 4 times larger at 55Gb. Is this normal? Here is how I am loading the file.

cluster = LocalCluster()  # Launches a scheduler and workers locally
client = Client(cluster)  # Connect to distributed cluster and override default@delayed
input_file_name = 'filename.txt'

@delayed
def load_file(fname, dtypes=dtypes):
    ddf = dd.read_csv(input_file_name, sep='\t', dtype=dtypes) #dytpes is dict of {colnames:bool}
    arr = ddf.to_dask_array(lengths=True)
    return arr
result = load_file(input_file_name)

arr = result.compute()
arr

Array Chunk
Bytes 54.58 GiB 245.18 MiB
Shape (1787307, 4099) (7840, 4099)
Count 456 Tasks 228 Chunks
Type object numpy.ndarray


I wasn't expecting the dask array to be so much larger than the input file size.

The file is contains binary values, so I tried passing bool dtype to see if it will shrink in size but I see no difference.


Solution

  • Found the answer- it was right there on the output of arr where it says the Type -> 'object'. Looks like converting from dask dataframe to array using arr = ddf.to_dask_array(lengths=True) does not preserve the bool object type. I was able to reduce the memory load significantly by explicitly casting it as bool type again.

    # load the dask dataframe and convert to array
    @delayed
    def load_to_arr(fname, dt):
        ddf = dd.read_csv(fname, sep='\t', dtype=dt)
        arr = unitig_ddf.to_dask_array(lengths=True)
        arr = unitig_arr.astype(bool, copy=False) # recast as bool array
        return unitig_arr
    result = load_to_arr(input_file_name, dt=dtypes)
    arr = result.compute()
    arr
    

    This now gives an object that is of type bool as size 6Gb, which is more reasonable.

    Array Chunk
    Bytes 6.82 GiB 30.65 MiB
    Shape (1787307, 4099) (7840, 4099)
    Count 456 Tasks 228 Chunks
    Type bool numpy.ndarray