I have a csv file that is 15Gb in size according to du -sh filename.txt
. However, when I load the file to dask, the dask array is almost 4 times larger at 55Gb. Is this normal? Here is how I am loading the file.
cluster = LocalCluster() # Launches a scheduler and workers locally
client = Client(cluster) # Connect to distributed cluster and override default@delayed
input_file_name = 'filename.txt'
@delayed
def load_file(fname, dtypes=dtypes):
ddf = dd.read_csv(input_file_name, sep='\t', dtype=dtypes) #dytpes is dict of {colnames:bool}
arr = ddf.to_dask_array(lengths=True)
return arr
result = load_file(input_file_name)
arr = result.compute()
arr
Array | Chunk | |
---|---|---|
Bytes | 54.58 GiB | 245.18 MiB |
Shape | (1787307, 4099) | (7840, 4099) |
Count | 456 Tasks | 228 Chunks |
Type | object | numpy.ndarray |
I wasn't expecting the dask array to be so much larger than the input file size.
The file is contains binary values, so I tried passing bool dtype to see if it will shrink in size but I see no difference.
Found the answer- it was right there on the output of arr where it says the Type -> 'object'. Looks like converting from dask dataframe to array using arr = ddf.to_dask_array(lengths=True)
does not preserve the bool object type. I was able to reduce the memory load significantly by explicitly casting it as bool type again.
# load the dask dataframe and convert to array
@delayed
def load_to_arr(fname, dt):
ddf = dd.read_csv(fname, sep='\t', dtype=dt)
arr = unitig_ddf.to_dask_array(lengths=True)
arr = unitig_arr.astype(bool, copy=False) # recast as bool array
return unitig_arr
result = load_to_arr(input_file_name, dt=dtypes)
arr = result.compute()
arr
This now gives an object that is of type bool
as size 6Gb, which is more reasonable.
Array | Chunk | |
---|---|---|
Bytes | 6.82 GiB | 30.65 MiB |
Shape | (1787307, 4099) | (7840, 4099) |
Count | 456 Tasks | 228 Chunks |
Type | bool | numpy.ndarray |