I have a pandas.DataFrame
to which I've appended a some meta information, in the form of an attribute. I'd like to save/restore df
with this in tact, but it gets erased in the saving process:
import pandas as pd
from sklearn import datasets
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df.my_attribute = 'can I recover this attribute after saving?'
df.to_pickle('test.pkl')
new_df = pd.read_pickle('test.pkl')
new_df.my_attribute
# AttributeError: 'DataFrame' object has no attribute 'my_attribute'
Other file formats appear to be worse: csv
and json
discard type
, index
or column
information if you're not careful. Maybe create a new class that extends DataFrame
? Open to ideas.
There is no universal, or anything close-to, standard here, but there are a few options
1) General advice - I'd wouldn't use pickle for anything but the shortest of terms serialization (like <1 day)
2) Arbitrary metadata can be packed into two of the binary formats pandas supports, msgpack and HDF5, granted in an ad-hoc way. You could also do this we CSV, etc, but it becomes even more ad-hoc.
# msgpack
data = {'df': df, 'my_attribute': df.my_attribute}
pd.to_msgpack('tmp.msg', data)
pd.read_msgpack('tmp.msg')['my_attribute']
# Out[70]: 'can I recover this attribute after saving?'
# hdf
with pd.HDFStore('tmp.h5') as store:
store.put('df', df)
store.get_storer('df').attrs.my_attribute = df.my_attribute
with pd.HDFStore('tmp.h5') as store:
df = store.get('df')
df.my_attribute = store.get_storer('df').attrs.my_attribute
df.my_attribute
Out[79]: 'can I recover this attribute after saving?'
3) xarray, which is a n-d extension of pandas support storing to the NetCDF file format, which has a more built-in notion of metadata
import xarray
ds = xarray.Dataset.from_dataframe(df)
ds.attrs['my_attribute'] = df.my_attribute
ds.to_netcdf('test.cdf')
ds = xarray.open_dataset('test.cdf')
ds
Out[8]:
<xarray.Dataset>
Dimensions: (index: 150)
Coordinates:
* index (index) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ...
Data variables:
sepal length (cm) (index) float64 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 ...
sepal width (cm) (index) float64 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 ...
petal length (cm) (index) float64 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 ...
petal width (cm) (index) float64 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 ...
Attributes:
my_attribute: can I recover this attribute after saving?