Search code examples
pythonpandasnumpypickle

Performance optimal way to serialise Python objects containing large Pandas DataFrames


I am dealing with Python objects containing Pandas DataFrame and Numpy Series objects. These can be large, several millions of rows.

E.g.


@dataclass
class MyWorld:
     # A lot of DataFrames with millions of rows
     samples: pd.DataFrame 
     addresses: pd.DataFrame 
     # etc.

I need to cache these objects, and I am hoping to find an efficient and painless way to serialise them, instead of standard pickle.dump(). Are there any specialised Python serialisers for such objects that would pickle Series data with some efficient codec and compression automatically? Alternatively, I need to hand construct several Parquet files, but that requires a lot of more manual code to deal with this, and I'd rather avoid that if possible.

Performance here may mean

  • Speed
  • File size (can be related, as you need to read less from the disk/network)

I am aware of joblib.dump() which does some magic for these kind of objects, but based on the documentation I am not sure if this is relevant anymore.


Solution

  • What about storing huge structures in parquet format while pickling it, this can be automated easily:

    import io
    from dataclasses import dataclass
    import pickle
    import numpy as np
    import pandas as pd
    
    @dataclass
    class MyWorld:
        
        array: np.ndarray
        series: pd.Series
        frame: pd.DataFrame
    
    @dataclass
    class MyWorldParquet:
        
        array: np.ndarray
        series: pd.Series
        frame: pd.DataFrame
            
        def __getstate__(self):
    
            for key, value in self.__annotations__.items():
                
                if value is np.ndarray:
                    self.__dict__[key] = pd.DataFrame({"_": self.__dict__[key]})
                
                if value is pd.Series:
                    self.__dict__[key] = self.__dict__[key].to_frame()
            
                stream = io.BytesIO()
                self.__dict__[key].to_parquet(stream)
                
                self.__dict__[key] = stream
            
            return self.__dict__
    
        def __setstate__(self, data):
            
            self.__dict__.update(data)
            
            for key, value in self.__annotations__.items():
            
                self.__dict__[key] = pd.read_parquet(self.__dict__[key])
                
                if value is np.ndarray:
                    self.__dict__[key] = self.__dict__[key]["_"].values
                
                if value is pd.Series:
                    self.__dict__[key] = self.__dict__[key][self.__dict__[key].columns[0]]
    

    Off course we will have some trade off between performance and volumetry as reducing the second requires format translation and compression.

    Lets create a toy dataset:

    N = 5_000_000
    data = {
        "array": np.random.normal(size=N),
        "series": pd.Series(np.random.uniform(size=N), name="w"),
        "frame": pd.DataFrame({
            "c": np.random.choice(["label-1", "label-2", "label-3"], size=N),
            "x": np.random.uniform(size=N),
            "y": np.random.normal(size=N)
        })
    }
    

    We can compare the parquet conversion trade off (about 300 ms more):

    %timeit -r 10 -n 1 pickle.dumps(MyWorld(**data))
    # 1.57 s ± 162 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
    
    %timeit -r 10 -n 1 pickle.dumps(MyWorldParquet(**data))
    # 1.9 s ± 71.3 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
    

    And the volumetry gain (about 40 Mb spared):

    len(pickle.dumps(MyWorld(**data))) / 2 ** 20
    # 200.28876972198486
    
    len(pickle.dumps(MyWorldParquet(**data))) / 2 ** 20
    # 159.13739013671875
    

    Indeed those metrics will strongly depends on the actual dataset to be serialized.