Search code examples
python-2.7character-encodingspecial-characters

Scandinavian letters (æøå) in python 2.7


So I am having this weird problem when using 'æ', 'ø' and 'å' in python.

I have included: # -- coding: utf-8 --
at the top of every file, and æøå prints fine so no worries there. However if i do len('æ') i get 2. I am making a program where i loop over and analyze danish text, so this is a big problem. Below is some examples from the python terminal to illustrate the problem:

In [1]: 'a'.islower()
Out[1]: True

In [2]: 'æ'.islower()
Out[2]: False

In [3]: len('a')
Out[3]: 1

In [4]: len('æ')
Out[4]: 2

In [5]: for c in 'æ': print c in "æøå"
True
True

In [6]: print "æøå are troublesome characters"
æøå are troublesome characters

I can get around the problem of islower() and isupper() not working for 'æ', 'ø' and 'å' by simply doing c.islower() or c in "æøå" to check if c is a lower case letter, but as shown above both parts of 'æ' will then count as a lower case and be counted double.

Is there a way that I can make those letters act like any other letter?

I run python 2.7 on windows 10 using canopy as its an easy way to get sklearn and numpy which i need.


Solution

  • You have stumbled across the problem that strings are bytes by default in python 2. With your header # -- coding: utf-8 -- you have only told the interpreter that your source code is utf-8 but this has no effect on the handling of strings.

    The solution to your problem is to convert all your strings to unicode objects with the decode method, e.g

    danish_text_raw = 'æ' # here you would load your text
    print(type(danish_text_raw)) # returns string
    danish_text = danish_text_raw.decode('utf-8') 
    print(type(danish_text)) # returns <type 'unicode'>
    

    The issues with islower and len should be fixed then. Make sure that all the strings you use in your program are unicode and not bytes objects. Otherwise comparisons can lead to strange results. For example

    danish_text_raw == danish_text # this yields false
    

    To make sure that you use unicode strings you can for example use this function to ensure it

    def to_unicode(in_string):
      if isinstance(in_string,str):
        out_string = in_string.decode('utf-8')
      elif isinstance(in_string,unicode):
        out_string = in_string
      else:
        raise TypeError('not stringy')
      return out_string