Search code examples
pythonyamlpython-3.6pyyaml

UnicodeDecodeError while processing Accented words


I have a python script which reads a YAML file (runs on an embedded system). Without accents, the script runs normally on my development machine and in the embedded system. But with accented words make it crash with

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128)

only in the embedded environment.

The YAML sample:

data: ã

The snippet which reads the YAML:

with open(YAML_FILE, 'r') as stream:
  try:
    data = yaml.load(stream)

Tried a bunch of solutions without success.

Versions: Python 3.6, PyYAML 3.12


Solution

  • The codec that is reading your bytes has been set to ASCII. This restricts you to byte values between 0 and 127.

    The representation of accented characters in Unicode, comes outside this range, so you're getting a decoding error.

    A UTF-8 codec decodes ASCII as well as UTF-8, because ASCII is a (very small) subset of UTF-8, by design.

    If you can change your codec to be a UTF-8 decode, it should work.

    In general, you should always specify how you will decode a byte stream to text, otherwise, your stream could be ambiguous.