yaml dump of a pandas dataframe

I figured I'd share, since I searched that on SO and couldn't quite find what I needed.

I wanted to dump a pd.DataFrame into a yaml file.

Timestamps should be shown nicely, not as the default:

  date: !!python/object/apply:pandas._libs.tslibs.timestamps.Timestamp
  - 1589241600000000000
  - null
  - null

Also, the output should be correct YaML format, i.e., it should be readable back by yaml.load. The output should be reasonably concise, i.e. preferring the 'flow' format.

As an example, here is some data:

df = pd.DataFrame([
    dict(
        date=pd.Timestamp.now().normalize() - pd.Timedelta('1 day'),
        x=0,
        b='foo',
        c=[1,2,3,4],
        other_t=pd.Timestamp.now(),
    ),
    dict(
        date=pd.Timestamp.now().normalize(),
        x=1,
        b='bar',
        c=list(range(32)),
        other_t=pd.Timestamp.now(),
    ),
]).set_index('date')

Solution

Here is what I came up with. It has some customization of the Dumper to handle Timestamp. The output is more legible, and still valid yaml. Upon loading, yaml recognizes the format of a valid datetime (ISO format, I think), and re-creates those as datetime. In fact, we can read it back into a DataFrame, where these datetimes are automatically converted into Timestamp. After a minor reset of index, we observe that the new df is identical to the original.

import yaml
from yaml import CDumper
from yaml.representer import SafeRepresenter
import datetime


class TSDumper(CDumper):
    pass

def timestamp_representer(dumper, data):
    return SafeRepresenter.represent_datetime(dumper, data.to_pydatetime())

TSDumper.add_representer(datetime.datetime, SafeRepresenter.represent_datetime)
TSDumper.add_representer(pd.Timestamp, timestamp_representer)

With this, now we can do:

text = yaml.dump(
    df.reset_index().to_dict(orient='records'),
    sort_keys=False, width=72, indent=4,
    default_flow_style=None, Dumper=TSDumper,
)
print(text)

The output is relatively clean:

-   date: 2020-05-12 00:00:00
    x: 0
    b: foo
    c: [1, 2, 3, 4]
    other_t: 2020-05-13 02:30:23.422589
-   date: 2020-05-13 00:00:00
    x: 1
    b: bar
    c: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
        19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
    other_t: 2020-05-13 02:30:23.422613

Now, we can load this back:

df2 = pd.DataFrame(yaml.load(text, Loader=yaml.SafeLoader)).set_index('date')

And (drum roll, please):

df2.equals(df)
# True