Search code examples
pythonunicodelatin1cp1252python-3.x

Python 3 chokes on CP-1252/ANSI reading


I'm working on a series of parsers where I get a bunch of tracebacks from my unit tests like:

  File "c:\Python31\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 112: character maps to <undefined>

The files are opened with open() with no extra arguemnts. Can I pass extra arguments to open() or use something in the codec module to open these differently?

This came up with code that was written in Python 2 and converted to 3 with the 2to3 tool.

UPDATE: it turns out this is a result of feeding a zipfile into the parser. The unit test actually expects this to happen. The parser should recognize it as something that can't be parsed. So, I need to change my exception handling. In the process of doing that now.


Solution

  • Position 0x81 is unassigned in Windows-1252 (aka cp1252). It is assigned to U+0081 HIGH OCTET PRESET (HOP) control character in Latin-1 (aka ISO 8859-1). I can reproduce your error in Python 3.1 like this:

    >>> b'\x81'.decode('cp1252')
    Traceback (most recent call last):
      ...
    UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 0: character maps to <undefined>
    

    or with an actual file:

    >>> open('test.txt', 'wb').write(b'\x81\n')
    2
    >>> open('test.txt').read()
    Traceback (most recent call last):
      ...
    UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 0: unexpected code byte
    

    Now to treat this file as Latin-1 you pass the encoding argument, like codeape suggested:

    >>> open('test.txt', encoding='latin-1').read()
    '\x81\n'
    

    Beware that there are differences between Windows-1257 and Latin-1 encodings, e.g. Latin-1 doesn't have “smart quotes”. If the file you're processing is a text file, ask yourself what that \x81 is doing in it.