Search code examples
pythonblazebcolz

data size blows out when storing in bcolz


I have a dataset with ~7M rows and 3 columns, 2 numeric and 1 consisting of ~20M distinct string uuids. The data takes around 3G as a csv file and castra can store it in about 2G. I would like to test out bcolz with this data.

I tried

odo(dask.dataframe.from_castra('data.castra'), 'data.bcolz')

which generated ~70G of data before exhausting inodes on the disk and crashing.

What is the recommended way to get such a dataset into bcolz?


Solution

  • From Killian Mie on the bcolz mailing list:

    Read csv in chunks via pandas.read_csv(), convert your string column from Python object dtype to a fix length numpy dtype, say, 'S20', then append as numpy array to ctable.

    Also, set chunklen=1000000 (or similar) at ctable creation which will avoid creating hundreds of files under the /data folder (probably not optimal for compression though)

    The 2 steps above worked well for me (20 million rows, 40-60 columns).

    Try this:

    df0 = ddf.from_castra("data.castra")
    df = odo.odo(df0, pd.DataFrame)
    names = df.columns.tolist()
    types = ['float32', 'float32', 'S20']  # adjust 'S20' to your max string length needs
    cols = [bcolz.carray(df[c].values, dtype=dt) for c, dt in zip(names, types)]
    
    ct = bcolz.zeros(0, dtype=np.dtype(zip(names, types)), 
                        mode='w', chunklen=1000000, 
                        rootdir="data.bcolz")
    ct.append(cols)