Search code examples
pythonpandasdataframefeather

Serialize Pandas DataFrame to in-memory buffer representation


What is the fastest way to serialize a DataFrame to an in-memory representation? Based on some research, it seems to be widely acknowledged that the Apache Feather format is the fastest available format by most metrics.

My goal is to get the serialized bytes of a DataFrame - the only issue with Feather is that I would like to avoid the overhead of writing to and loading from disk, and the Feather API seems to only allow File I/O. Is there a different format I should be looking into for this, or is there perhaps a way in Python to "fake" a file, forcing Feather to write to an in-memory buffer instead?


Solution

  • pyarrow provides BufferOutputStream for writing into memory instead of files. In constrast to the docstring, read_feather and write_feather also support reading from memory / writing into a writer interface.

    With the following code, you can serialise a DataFrame into memory without going to the filesystem and then directly reconstruct it again.

    from pyarrow.feather import read_feather, write_feather
    import pandas as pd
    import pyarrow as pa
    
    df = pd.DataFrame({"column": [1, 2]})
    output_stream = pa.BufferOutputStream()
    write_feather(df, output_stream)
    df_reconstructed = read_feather(output_stream.getvalue())