python python-2.7 python-3.7 python-unicode pyyaml

Yaml Python 2.7 and Python 3.7 compatible dump and load with unicode characters

I'm having issues loading/dumping yaml files with PyYaml that need to be compatible with both Python 2 and Python 3.

For Python 3 dumping/ Python 2 loading, I found a solution:

import yaml
data = {"d": "😋"}
with open(file_path, "w") as f:
    yaml.dump(data, f, allow_unicode=True)

This produces a yaml file with this line:

d: 😋

If I try to load this file with Python 2:

with open(file_path, "r") as f:
    y = yaml.safe_load(f)
    print(y["d"])

I get the following output:

😋

But now if I try to dump a file with Python 2, I tried:

data = {"d": u"😋"}
with open(file_name, "w") as f:
   yaml.dump(f)

which produces a yaml file:

d: "\uD83D\uDE0B"

I also tried:

data = {"d": u"😋".encode("utf-8")}
with open(file_name, "w") as f:
   yaml.dump(f)

which produces a yaml file:

d: !!python/str "\uD83D\uDE0B"

In both cases, if I load with Python 3:

with open(file_path, "r") as f:
    y = yaml.load(f)

then y["d"] is '\ud83d\ude0b' which cannot be used as is.

I found out I could do something like

y["d"].encode("utf-16", "surrogatepass").decode("utf-16")

but that seems like an overkill.

So what's the solution for dumping a file with Python 2 that is readable and properly interpreted in Python 3?

Solution

I ended up adding a constructor for this. I add it to a custom loader, so I do self.add_constructor, but it's the same at the yaml level, easier to illustrate with that.

yaml.add_constructor("tag:yaml.org,2002:python/str", unicode_constructor)

def unicode_constructor(loader, node):
    scalar = loader.construct_scalar(node)
    return scalar.encode("utf-16", "surrogatepass").decode("utf-16")

This works for Python2 dump/ Python 3 load

and doesn't affect Python 3 dump/ Python 2 or 3 load