Search code examples
pythonpandasobjectpickle

Save/load pandas dataframe with custom attributes


I have a pandas.DataFrame to which I've appended a some meta information, in the form of an attribute. I'd like to save/restore df with this in tact, but it gets erased in the saving process:

import pandas as pd
from sklearn import datasets
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)

df.my_attribute = 'can I recover this attribute after saving?'
df.to_pickle('test.pkl')
new_df = pd.read_pickle('test.pkl')
new_df.my_attribute

# AttributeError: 'DataFrame' object has no attribute 'my_attribute'

Other file formats appear to be worse: csv and json discard type, index or column information if you're not careful. Maybe create a new class that extends DataFrame? Open to ideas.


Solution

  • There is no universal, or anything close-to, standard here, but there are a few options

    1) General advice - I'd wouldn't use pickle for anything but the shortest of terms serialization (like <1 day)

    2) Arbitrary metadata can be packed into two of the binary formats pandas supports, msgpack and HDF5, granted in an ad-hoc way. You could also do this we CSV, etc, but it becomes even more ad-hoc.

    # msgpack
    data = {'df': df, 'my_attribute': df.my_attribute}
    pd.to_msgpack('tmp.msg', data)
    pd.read_msgpack('tmp.msg')['my_attribute']
    # Out[70]: 'can I recover this attribute after saving?'
    
    # hdf
    with pd.HDFStore('tmp.h5') as store:
        store.put('df', df)
        store.get_storer('df').attrs.my_attribute = df.my_attribute    
    with pd.HDFStore('tmp.h5') as store:
        df = store.get('df')
        df.my_attribute = store.get_storer('df').attrs.my_attribute
    
    df.my_attribute
    Out[79]: 'can I recover this attribute after saving?'
    

    3) xarray, which is a n-d extension of pandas support storing to the NetCDF file format, which has a more built-in notion of metadata

    import xarray
    ds = xarray.Dataset.from_dataframe(df)
    ds.attrs['my_attribute'] = df.my_attribute
    
    ds.to_netcdf('test.cdf')
    ds = xarray.open_dataset('test.cdf')
    ds
    Out[8]: 
    <xarray.Dataset>
    Dimensions:            (index: 150)
    Coordinates:
      * index              (index) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ...
    Data variables:
        sepal length (cm)  (index) float64 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 ...
        sepal width (cm)   (index) float64 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 ...
        petal length (cm)  (index) float64 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 ...
        petal width (cm)   (index) float64 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 ...
    Attributes:
        my_attribute:  can I recover this attribute after saving?