just another question about encoding in python i think. I have this programm:
regex = re.compile(ur'\b[sw]\w+', flags= re.U | re.I)
ergebnisliste = []
for line in fileobject:
print str(line)
erg = regex.findall(line)
ergebnisliste = ergebnisliste + erg
ergebnislistesortiert = sorted(ergebnisliste, key=lambda x: len(x))
print ergebnislistesortiert
fileobject.close()
I am searching a textfile for words beginning with s or w. My "ergebnislistesortiert" is the sorted result list. I will print the result list and there appers to be a problem with the encoding:
['so', 'Wer', 'sp\xc3']
the 'sp\xc3' should be print as spät. What is wrong here? Why is the list element utf-8?
And how can i get the right decoding to print "spät"?
Thanks a lot guys!
\xc3
is not UTF-8. It's a fragment of the full UTF-8 encoding of U+00E4 but you're probably reading it with something like a Latin-1 decoder (which is effectively what Python 2 does if you read bytes without specifying an encoding), in which case the second byte in the UTF-8 sequence isn't matched by \w
.
The real fix is to decode the data when you are reading it into Python in the first place. If you are writing new code, switching to Python 3 is probably the best and easiest fix.
If you're stuck on Python 2.7, a somewhat Python 3-compatible approach is something like
import io
fileobject = io.open(filename, encoding='utf-8')
If you have control over the input file and want to postpone the proper solution until you are older, (ask your parents for permission to) convert the UTF-8 input file to some legacy 8-bit encoding.