What is the fastest way to serialize a DataFrame to an in-memory representation? Based on some research, it seems to be widely acknowledged that the Apache Feather format is the fastest available format by most metrics.
My goal is to get the serialized bytes of a DataFrame - the only issue with Feather is that I would like to avoid the overhead of writing to and loading from disk, and the Feather API seems to only allow File I/O. Is there a different format I should be looking into for this, or is there perhaps a way in Python to "fake" a file, forcing Feather to write to an in-memory buffer instead?
pyarrow
provides BufferOutputStream
for writing into memory instead of files. In constrast to the docstring, read_feather
and write_feather
also support reading from memory / writing into a writer interface.
With the following code, you can serialise a DataFrame into memory without going to the filesystem and then directly reconstruct it again.
from pyarrow.feather import read_feather, write_feather
import pandas as pd
import pyarrow as pa
df = pd.DataFrame({"column": [1, 2]})
output_stream = pa.BufferOutputStream()
write_feather(df, output_stream)
df_reconstructed = read_feather(output_stream.getvalue())