Search code examples
pythonpandasyamlpyyaml

Making a pandas series safe YAML


I'm working with one script that dumps a pandas series to a yaml file:

with open('ex.py','w') as f:
    yaml.dump(a_series,f)

And then another script that opens the yaml file for the pandas series:

with open('ex.py','r') as f:
    yaml.safe_load(a_series,f)

I'm trying to safe_load the series but I get a constructor error. How can I specify that the pandas series is safe to load?


Solution

  • When you use PyYAML's load, you specify that everything in the YAML document you are loading is safe. That is why you need to use yaml.safe_load.

    In your case this leads to an error, because safe_load doesn't know how to construct pandas internals that have tags in the YAML document like:

    !!python/name:pandas.core.indexes.base.Index
    

    and

    !!python/tuple
    

    etc.

    You would need to provide constructors for all the objects, add these to the SafeLoader and then do a_series = yaml.load(f). Doing that can be a lot of work, especially since what looks like a small change to the data used in your series might require you to add constructors.

    You could dump the dict representation of your Series and load that back. Of course some information is lost in this process, I am not sure if that is acceptable:

    import sys
    import yaml
    from pandas import Series
    
    def series_representer(dumper, data):
        return dumper.represent_mapping(u'!pandas.series', data.to_dict())
    
    yaml.add_representer(Series, series_representer, Dumper=yaml.SafeDumper)
    
    def series_constructor(loader, node):
        d = loader.construct_mapping(node)
        return Series(data)
    
    yaml.add_constructor(u'!pandas.series', series_constructor, Loader=yaml.SafeLoader)
    
    data = Series([1,2,3,4,5], index=['a', 'b', 'c', 'd', 'e'])
    
    with open('ex.yaml', 'w') as f:
        yaml.safe_dump(data, f)
    
    with open('ex.yaml') as f:
        s = yaml.safe_load(f)
    
    print(s)
    print(type(s))
    

    which gives:

    a    1
    b    2
    c    3
    d    4
    e    5
    dtype: int64
    <class 'pandas.core.series.Series'>
    

    And the ex.yaml file contains:

    !pandas.series {a: 1, b: 2, c: 3, d: 4, e: 5}
    

    There are a few things to note:

    • YAML documents are normally written to files with a .yaml extension. Using .py is bound to get you confused, or have you overwrite some program source files at some point.

    • yaml.load() and yaml.safe_load() take a stream as first paramater you use them like:

      data = yaml.safe_load(stream)
      

      and not like:

      yaml.safe_load(data, stream)
      
    • It would be better to have a two step constructor, which allows you to construct self referential data structures. However Series.append() doesn't seem to work for that:

      def series_constructor(loader, node):
          d = Series()
          yield d
          d.append(Series(loader.construct_mapping(node)))
      

    If dumping the Series via a dictionary is not good enough (because it simplifies the series' data), and if you don't care about the readability of the YAML generated, you can instead of .to_dict() use to to_pickle() but you would have to work with temporary files, as that method is not flexible enough to handle file like objects and expects a file name string as argument.