Search code examples
pythonutf-8decodeencodelatin1

Python UTF-8 Latin-1 displays wrong character


I'm writing a very small script that can convert latin-1 characters into unicode (I'm a complete beginner in Python).

I tried a method like this:

def latin1_to_unicode(character):

    uni = character.decode('latin-1').encode("utf-8")
    retutn uni

It works fine for characters that are not specific to the latin-1 set, but if I try the following example:

print latin1_to_Unicode('å')

It returns Ã¥ instead of å. Same goes for other letters like æ and ø.

Can anyone please explain why this is happening? Thanks

I have the # -*- coding: utf8 -*- declaration in my script, if it matters any to the problem


Solution

  • Your source code is encoded to UTF-8, but you are decoding the data as Latin-1. Don't do that, you are creating a Mojibake.

    Decode from UTF-8 instead, and don't encode again. print will write to sys.stdout which will have been configured with your terminal or console codec (detected when Python starts).

    My terminal is configured for UTF-8, so when I enter the å character in my terminal, UTF-8 data is produced:

    >>> 'å'
    '\xc3\xa5'
    >>> 'å'.decode('latin1')
    u'\xc3\xa5'
    >>> print 'å'.decode('latin1')
    å
    

    You can see that the character uses two bytes; when saving your Python source with an editor configured to use UTF-8, Python reads the exact same bytes from disk to put into your bytestring.

    Decoding those two bytes as Latin-1 produces two Unicode codepoints corresponding to the Latin-1 codec.

    You probably want to do some studying on the difference between Unicode and encodings, and how that relates to Python: