Search code examples
pythonpython-2.7yamlpyyaml

How to get string objects instead of Unicode from YAML efficiently while using construct_mapping and add_constructor?


I'm using pyyaml(Version: 5.1) and Python 2 to parse a YAML data body of an incoming POST API request.

The body of the incoming request contains some Unicode objects, along with some string objects.

The solution given in link is used to load the YAML mapping into an OrderedDict, where the stream refers to the incoming POST API request's YAML data body.

But, I have to use the OrderedDict generated from the link with some library that only accepts string objects.

I can't change the library nor update it and I've to use Python 2.

The current solution for this, which is being used is,

  1. take the OrderedDict generated from the link
  2. recursively parse it, converting any found occurrence of a Unicode object into a String object

The sample code for the same is as below,

def convert(data):
    if isinstance(data, unicode):
        return data.encode('utf-8')
    if isinstance(data, list):
        return [convert(item) for item in data]
    if isinstance(data, dict):
        newData = {}
        for key, value in data.iteritems():
            newData[convert(key)] = convert(value)
        return newData
     return data

Although this works, the solution is not efficient, as the complete OrderedDict is parsed after it is being created.

Is there a way, where the conversion of the data can be done before or during the generation of the OrderedDict, to avoid parsing it again?


Solution

  • You can provide a custom constructor that will always load YAML !!str scalars to Python unicode strings:

    import yaml
    from yaml.resolver import BaseResolver
    
    def unicode_constructor(self, node):
      # this will always return a unicode string;
      # the default loader would convert it to ASCII-encoded str if possible.
      return self.construct_scalar(node)
    
    yaml.add_constructor(BaseResolver.DEFAULT_SCALAR_TAG, unicode_constructor)
    

    Afterwards, yaml.load will always return unicode strings.

    (Code untested as I don't have a Python 2 installation)