So I am having this weird problem when using 'æ', 'ø' and 'å' in python.
I have included: # -- coding: utf-8 --
at the top of every file, and æøå prints fine so no worries there. However if i do len('æ')
i get 2. I am making a program where i loop over and analyze danish text, so this is a big problem.
Below is some examples from the python terminal to illustrate the problem:
In [1]: 'a'.islower()
Out[1]: True
In [2]: 'æ'.islower()
Out[2]: False
In [3]: len('a')
Out[3]: 1
In [4]: len('æ')
Out[4]: 2
In [5]: for c in 'æ': print c in "æøå"
True
True
In [6]: print "æøå are troublesome characters"
æøå are troublesome characters
I can get around the problem of islower() and isupper() not working for 'æ', 'ø' and 'å' by simply doing c.islower() or c in "æøå"
to check if c is a lower case letter, but as shown above both parts of 'æ' will then count as a lower case and be counted double.
Is there a way that I can make those letters act like any other letter?
I run python 2.7 on windows 10 using canopy as its an easy way to get sklearn and numpy which i need.
You have stumbled across the problem that strings are bytes by default in python 2. With your header # -- coding: utf-8 -- you have only told the interpreter that your source code is utf-8 but this has no effect on the handling of strings.
The solution to your problem is to convert all your strings to unicode objects with the decode method, e.g
danish_text_raw = 'æ' # here you would load your text
print(type(danish_text_raw)) # returns string
danish_text = danish_text_raw.decode('utf-8')
print(type(danish_text)) # returns <type 'unicode'>
The issues with islower and len should be fixed then. Make sure that all the strings you use in your program are unicode and not bytes objects. Otherwise comparisons can lead to strange results. For example
danish_text_raw == danish_text # this yields false
To make sure that you use unicode strings you can for example use this function to ensure it
def to_unicode(in_string):
if isinstance(in_string,str):
out_string = in_string.decode('utf-8')
elif isinstance(in_string,unicode):
out_string = in_string
else:
raise TypeError('not stringy')
return out_string