Search code examples
pandaspyyaml

yaml dump of a pandas dataframe


I figured I'd share, since I searched that on SO and couldn't quite find what I needed.

I wanted to dump a pd.DataFrame into a yaml file.

Timestamps should be shown nicely, not as the default:

  date: !!python/object/apply:pandas._libs.tslibs.timestamps.Timestamp
  - 1589241600000000000
  - null
  - null

Also, the output should be correct YaML format, i.e., it should be readable back by yaml.load. The output should be reasonably concise, i.e. preferring the 'flow' format.

As an example, here is some data:

df = pd.DataFrame([
    dict(
        date=pd.Timestamp.now().normalize() - pd.Timedelta('1 day'),
        x=0,
        b='foo',
        c=[1,2,3,4],
        other_t=pd.Timestamp.now(),
    ),
    dict(
        date=pd.Timestamp.now().normalize(),
        x=1,
        b='bar',
        c=list(range(32)),
        other_t=pd.Timestamp.now(),
    ),
]).set_index('date')

Solution

  • Here is what I came up with. It has some customization of the Dumper to handle Timestamp. The output is more legible, and still valid yaml. Upon loading, yaml recognizes the format of a valid datetime (ISO format, I think), and re-creates those as datetime. In fact, we can read it back into a DataFrame, where these datetimes are automatically converted into Timestamp. After a minor reset of index, we observe that the new df is identical to the original.

    import yaml
    from yaml import CDumper
    from yaml.representer import SafeRepresenter
    import datetime
    
    
    class TSDumper(CDumper):
        pass
    
    def timestamp_representer(dumper, data):
        return SafeRepresenter.represent_datetime(dumper, data.to_pydatetime())
    
    TSDumper.add_representer(datetime.datetime, SafeRepresenter.represent_datetime)
    TSDumper.add_representer(pd.Timestamp, timestamp_representer)
    

    With this, now we can do:

    text = yaml.dump(
        df.reset_index().to_dict(orient='records'),
        sort_keys=False, width=72, indent=4,
        default_flow_style=None, Dumper=TSDumper,
    )
    print(text)
    

    The output is relatively clean:

    -   date: 2020-05-12 00:00:00
        x: 0
        b: foo
        c: [1, 2, 3, 4]
        other_t: 2020-05-13 02:30:23.422589
    -   date: 2020-05-13 00:00:00
        x: 1
        b: bar
        c: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
            19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
        other_t: 2020-05-13 02:30:23.422613
    

    Now, we can load this back:

    df2 = pd.DataFrame(yaml.load(text, Loader=yaml.SafeLoader)).set_index('date')
    

    And (drum roll, please):

    df2.equals(df)
    # True