Search code examples
pythonserializationnumpypandaspickle

What is the fastest way to serialize a DataFrame besides to_pickle?


I need to serialize DataFrames and send them over the wire. For security reasons, I cannot use pickle.

What would be the next fastest way to do this? I was intrigued by msgpacks in v0.13, but unless I'm doing something wrong, the performance seems much worse than with pickle.

In [107]: from pandas.io.packers import pack

In [108]: df = pd.DataFrame(np.random.rand(1000, 100))

In [109]: %timeit buf = pack(df)
100 loops, best of 3: 15.5 ms per loop

In [110]: import pickle

In [111]: %timeit buf = pickle.dumps(df)
1000 loops, best of 3: 241 µs per loop

The best I've found so far is just serializing homogenous numpy arrays (df.as_blocks() was handy) using array.tostring() and rebuilding the DataFrame from them. The performance is comparable to pickle.

However, with this approach, I am forced to convert columns of dtype=object (i.e., anything with at least a string) to be entirely string since Numpy's fromstring() cannot deserialize dtype=object. Pickle manages to preserve mixed types in object columns (it seems to be including some function in the pickle output).


Solution

  • This is now pretty competetive with this PR: https://github.com/pydata/pandas/pull/5498 (going to merge for 0.13 shortly)

    In [1]: from pandas.io.packers import pack
    
    In [2]: import cPickle as pkl
    
    In [3]: df = pd.DataFrame(np.random.rand(1000, 100))
    

    Above example

    In [6]: %timeit buf = pack(df)
    1000 loops, best of 3: 492 µs per loop
    
    In [7]: %timeit buf = pkl.dumps(df,pkl.HIGHEST_PROTOCOL)
    1000 loops, best of 3: 681 µs per loop
    

    Much bigger frame

    In [8]: df = pd.DataFrame(np.random.rand(100000, 100))
    
    In [9]:  %timeit buf = pack(df)
    10 loops, best of 3: 192 ms per loop
    
    In [10]: %timeit buf = pkl.dumps(df,pkl.HIGHEST_PROTOCOL)
    10 loops, best of 3: 119 ms per loop
    

    Another option is to use an in-memory hdf file

    See here: http://pytables.github.io/cookbook/inmemory_hdf5_files.html; no support yet in pandas to add the driver arg (could be done by pretty simply by monkey-patching though).

    Another possibity a ctable, see https://github.com/FrancescAlted/carray. Not supported yet in pandas ATM though.