Search code examples
pythonparsingconfigparserpython-config

Python UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3


I'm reading a config file in python getting sections and creating new config files for each section.

However.. I'm getting a decode error because one of the strings contains Español=spain

self.output_file.write( what.replace( " = ", "=", 1 ) )
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

How would I adjust my code to allow for encoded characters such as these? I'm very new to this so please excuse me if this is something simple..

class EqualsSpaceRemover:
    output_file = None
    def __init__( self, new_output_file ):
        self.output_file = new_output_file

    def write( self, what ):
        self.output_file.write( what.replace( " = ", "=", 1 ) )

def get_sections():
    configFilePath = 'C:\\test.ini'
    config = ConfigParser.ConfigParser()
    config.optionxform = str
    config.read(configFilePath)
    for section in config.sections():
        configdata = {k:v for k,v in config.items(section)}
        confignew = ConfigParser.ConfigParser()
        cfgfile = open("C:\\" + section + ".ini", 'w')
        confignew.add_section(section)
        for x in configdata.items():
            confignew.set(section,x[0],x[1])
        confignew.write( EqualsSpaceRemover( cfgfile ) )
        cfgfile.close()

Solution

  • If you use python2 with from __future__ import unicode_literals then every string literal you write is an unicode literal, as if you would prefix every literal with u"...", unless you explicitly write b"...".

    This explains why you get an UnicodeDecodeError on this line:

    what.replace(" = ", "=", 1)
    

    because what you actually do is

    what.replace(u" = ",u"=",1 )
    

    ConfigParser uses plain old str for its items when it reads a file using the parser.read() method, which means what will be a str. If you use unicode as arguments to str.replace(), then the string is converted (decoded) to unicode, the replacement applied and the result returned as unicode. But if what contains characters that can't be decoded to unicode using the default encoding, then you get an UnicodeDecodeError where you wouldn't expect one.

    So to make this work you can

    • use explicit prefixes for byte strings: what.replace(b" = ", b"=", 1)
    • or remove the unicode_litreals future import.

    Generally you shouldn't mix unicode and str (python3 fixes this by making it an error in almost any case). You should be aware that from __future__ import unicode_literals changes every non prefixed literal to unicode and doesn't automatically change your code to work with unicode in all case. Quite the opposite in many cases.