Search code examples
pandasdataframeparquetdaskfastparquet

Dask DataFrame to_parquet return bytes instead of writing to file


Is it possible to write dask/pandas DataFrame to parquet and than return bytes string? I know that is not possible with to_parquet() function which accepts file path. Maybe, you have some other ways to do it. If there is no possibility to do something like this, is it makes sense to add such functionality? Ideally, it should be like this:

parquet_bytes = df.to_parquet() # bytes string is returned

Thanks!


Solution

  • There has been work undertaken to allow such a thing, but it's not currently a one-line thing like you suggest.

    Firstly, if you have data which can fit in memory, you can use fastparquet's write() method, and supply an open= argument. This must be a function that creates a file-like object in binary-write mode, in your case a BytesIO() would do.

    To make this work directly with dask, you could make use of the MemoryFileSystem from the filesystem_spec project. You would need to add the class to Dask and write as following:

    dask.bytes.core._filesystems['memory']  = fsspec.implementations.memory.MemoryFileSystem
    df.to_parquet('memory://name.parquet')
    

    When done, MemoryFileSystem.store, which is a class attribute, will contain keys that are like filenames, and values which are BytesIO objects containing data.