Search code examples
pythonunicodeutf-8isalpha

isalpha giving True for some Sinhala words


I'm trying to check if a sentence only has Sinhala words (they can be nonsense words as long as they are written in Sinhala). Sometimes there can be English words in a sentence mixed with sinhala words. The thing is sometimes Sinhala words give True when checked with isalpha() giving incorrect results in my classification.

for example I did something like this.

for i in ['මට', 'කෑම', 'කන්න', 'ඕන']:
  print(i.isalpha())

gives

True
False
False
True

Is there a way to overcome this


Solution

  • How isalpha works is by checking if the category of a character for Unicode is Lm, Lt, Lu, Ll, or Lo. See below for their meaning.

    Ll    Lowercase Letter
    Lm    Modifier Letter
    Lo    Other Letter
    Lu    Uppercase Letter
    

    This "breaks" python when characters are joined together. In your first example if we see or the category (from the lookup tool below) is Lo. This is valid so it gives us True In your second example, the first letter is කෑ which is actually two characters ( and ). The category for is not a letter one so it returns False.

    Long story short, Python is technically right. If you we were to do what you intended you would have to split joined characters and then remove the extra characters added on.

    So, it is complicated. There may be a library out there that does this but I do not know any.

    Cheers