Search code examples
python-2.7utf-8string-decoding

Python Decoding and Encoding, List Element utf-8


just another question about encoding in python i think. I have this programm:

regex = re.compile(ur'\b[sw]\w+', flags= re.U | re.I)
ergebnisliste = []
for line in fileobject:
  print str(line) 
  erg = regex.findall(line)
  ergebnisliste = ergebnisliste + erg
ergebnislistesortiert = sorted(ergebnisliste, key=lambda x: len(x))
print ergebnislistesortiert
fileobject.close()

I am searching a textfile for words beginning with s or w. My "ergebnislistesortiert" is the sorted result list. I will print the result list and there appers to be a problem with the encoding:

['so', 'Wer', 'sp\xc3']

the 'sp\xc3' should be print as spät. What is wrong here? Why is the list element utf-8?

And how can i get the right decoding to print "spät"?

Thanks a lot guys!


Solution

  • \xc3 is not UTF-8. It's a fragment of the full UTF-8 encoding of U+00E4 but you're probably reading it with something like a Latin-1 decoder (which is effectively what Python 2 does if you read bytes without specifying an encoding), in which case the second byte in the UTF-8 sequence isn't matched by \w.

    The real fix is to decode the data when you are reading it into Python in the first place. If you are writing new code, switching to Python 3 is probably the best and easiest fix.

    If you're stuck on Python 2.7, a somewhat Python 3-compatible approach is something like

    import io
    fileobject = io.open(filename, encoding='utf-8')
    

    If you have control over the input file and want to postpone the proper solution until you are older, (ask your parents for permission to) convert the UTF-8 input file to some legacy 8-bit encoding.