python-2.7 character-encoding special-characters

Scandinavian letters (æøå) in python 2.7

So I am having this weird problem when using 'æ', 'ø' and 'å' in python.

I have included: # -- coding: utf-8 --
at the top of every file, and æøå prints fine so no worries there. However if i do len('æ') i get 2. I am making a program where i loop over and analyze danish text, so this is a big problem. Below is some examples from the python terminal to illustrate the problem:

In [1]: 'a'.islower()
Out[1]: True

In [2]: 'æ'.islower()
Out[2]: False

In [3]: len('a')
Out[3]: 1

In [4]: len('æ')
Out[4]: 2

In [5]: for c in 'æ': print c in "æøå"
True
True

In [6]: print "æøå are troublesome characters"
æøå are troublesome characters

I can get around the problem of islower() and isupper() not working for 'æ', 'ø' and 'å' by simply doing c.islower() or c in "æøå" to check if c is a lower case letter, but as shown above both parts of 'æ' will then count as a lower case and be counted double.

Is there a way that I can make those letters act like any other letter?

I run python 2.7 on windows 10 using canopy as its an easy way to get sklearn and numpy which i need.

Solution

You have stumbled across the problem that strings are bytes by default in python 2. With your header # -- coding: utf-8 -- you have only told the interpreter that your source code is utf-8 but this has no effect on the handling of strings.

The solution to your problem is to convert all your strings to unicode objects with the decode method, e.g

danish_text_raw = 'æ' # here you would load your text
print(type(danish_text_raw)) # returns string
danish_text = danish_text_raw.decode('utf-8') 
print(type(danish_text)) # returns <type 'unicode'>

The issues with islower and len should be fixed then. Make sure that all the strings you use in your program are unicode and not bytes objects. Otherwise comparisons can lead to strange results. For example

danish_text_raw == danish_text # this yields false

To make sure that you use unicode strings you can for example use this function to ensure it

def to_unicode(in_string):
  if isinstance(in_string,str):
    out_string = in_string.decode('utf-8')
  elif isinstance(in_string,unicode):
    out_string = in_string
  else:
    raise TypeError('not stringy')
  return out_string