I'm using a webapp to retrieve data from results of a game I play. As I'm brazilian and my language has some latin accented characters, most of the data I retrieve comes in a bad shape for use. Like:
Carlos Lopez = Carlos Lã³Pez
I searched internet and found ftfy as a good fixer for broken text. Anyway, I'm not really aware about unicode, ascii and stuff, so I used ftfy, and received as output the same errors I already have.
In[15]: ftfy.fix_text('Carlos Lã³Pez')
Out[15]: 'Carlos Lã³Pez'
ftfy.fix_encoding('Carlos Lã³Pez')
Out[16]: 'Carlos Lã³Pez'
ftfy.fix_text('Carlos Lã³Pez')
Out[17]: 'Carlos Lã³Pez'
print(ftfy.fix_text('Carlos Lã³Pez'))
Carlos Lã³Pez
print(ftfy.fix_encoding('Carlos Lã³Pez'))
Carlos Lã³Pez
ftfy.explain_unicode('Carlos Lã³Pez')
U+0043 C [Lu] LATIN CAPITAL LETTER C
U+0061 a [Ll] LATIN SMALL LETTER A
U+0072 r [Ll] LATIN SMALL LETTER R
U+006C l [Ll] LATIN SMALL LETTER L
U+006F o [Ll] LATIN SMALL LETTER O
U+0073 s [Ll] LATIN SMALL LETTER S
U+0020 [Zs] SPACE
U+004C L [Lu] LATIN CAPITAL LETTER L
U+00E3 ã [Ll] LATIN SMALL LETTER A WITH TILDE
U+00B3 ³ [No] SUPERSCRIPT THREE
U+0050 P [Lu] LATIN CAPITAL LETTER P
U+0065 e [Ll] LATIN SMALL LETTER E
U+007A z [Ll] LATIN SMALL LETTER Z
ftfy.explain_unicode(unidecode('Carlos Lã³Pez'))
U+0043 C [Lu] LATIN CAPITAL LETTER C
U+0061 a [Ll] LATIN SMALL LETTER A
U+0072 r [Ll] LATIN SMALL LETTER R
U+006C l [Ll] LATIN SMALL LETTER L
U+006F o [Ll] LATIN SMALL LETTER O
U+0073 s [Ll] LATIN SMALL LETTER S
U+0020 [Zs] SPACE
U+004C L [Lu] LATIN CAPITAL LETTER L
U+0061 a [Ll] LATIN SMALL LETTER A
U+0033 3 [Nd] DIGIT THREE
U+0050 P [Lu] LATIN CAPITAL LETTER P
U+0065 e [Ll] LATIN SMALL LETTER E
U+007A z [Ll] LATIN SMALL LETTER Z
print(ftfy.fix_encoding(unidecode('Carlos Lã³Pez')))
Carlos La3Pez
print(ftfy.fix_text(unidecode('Carlos Lã³Pez')))
Carlos La3Pez
I'd like to know if there's any package to fix this kind of error, or if you could give me any lead why Carlos López turned into Carlos Lã³Pez. I'd apreciatte.
Wow, that was tough :) Your string was in the wrong encoding and wrong character case, too.
s = 'Carlos Lã³Pez'
s.upper().encode('cp1252').decode().title()
#'Carlos López'
This code works in Python3, but not in Python2.