Search code examples
pythonpython-3.xunicodethaigrapheme

In Python 3, count Thai character positions


FIRST, I've used the Python 3 grapheme library to solve my problem. (For a bit more about grapheme, see this article). But I'm surprised that Python 3 couldn't do this without a specialized library...


I resorted to grapheme because after many web searches and reading of StackOverflow questions, I couldn't get Python 3 to return the correct number of character positions in a sequence of Thai characters.

For example, here's a UTF-8 string of Thai characters:

thai_str = 'สีโชคดีเป็นสีชมพู สีโชคร้ายเป็นสีเหลืองและขาว'

I use the term character position to identify a single position in a line/string of Thai characters. That's because a character position may consist of a Thai consonant plus, in some cases, a vowel or tone marker above or below that consonant. The consonant plus the vowel or tone marker above/below occupies a single character position in the Unicode string. (Some Thai consonants may also have vowels to their left, right, or both. Those vowels occupy their own character position.)

For example, in the following sequence generated from the example string, items 2 and 7 are vowels, and item 10 is a tone marker. Each consume separate bytes in the UTF-8 string but don't occupy their own character positions. Items 3 and 8 are vowels that go to the left of a consonant and so occupy character positions.

01: ส
02: ี
03: โ
04: ช
05: ค
06: ด
07: ี
08: เ
09: ป
10: ็
...
45: ว

When trying to determine the character positions in the example string, len(thai_str) returns 45. Which isn't correct. The only way I've been able to do get the correct number of character positions is to use grapheme.length(thai_str) to get 35.

I've also used encode to get the following:

b'\xe0\xb8\xaa\xe0\xb8\xb5\xe0\xb9\x82\xe0\xb8\x8a\xe0\xb8\x84\xe0\xb8\x94...

(Counting the instances of xe0 that seem to precede every Thai character doesn't feel like the correct approach...)

SO -- is the only way to count character positions in my example string be to use a Python 3 library such as grapheme?


Solution

  • It's not the only way, if you want to implement a grapheme counter yourself, but it's complex and you have to consult https://unicode.org specifications to get it right.

    thai_str is not a UTF-8 string, but a Unicode string containing Unicode code points. There are different categories of code points. The two categories used by the sample text needed for counting character positions are:

    • Lo Other_Letter, other letters, including syllables and ideographs;
    • Mn Nonspacing_Mark, a nonspacing combining mark (zero advance width).

    If you skip counting the Nonspacing_Mark (Mn) category of code points, you can see approximately what the grapheme library is doing:

    import unicodedata as ud
    
    thai_str = 'สีโชคดีเป็นสีชมพู สีโชคร้ายเป็นสีเหลืองและขาว'
    
    for cp in thai_str:
        print(f'{cp}\t{ud.category(cp)}\t{ud.name(cp)}')
    
    print(sum(1 for cp in thai_str if ud.category(cp)[0] != 'M'))
    

    Output:

    ส   Lo  THAI CHARACTER SO SUA
    ี   Mn  THAI CHARACTER SARA II
    โ   Lo  THAI CHARACTER SARA O
    ช   Lo  THAI CHARACTER CHO CHANG
    ค   Lo  THAI CHARACTER KHO KHWAI
    ด   Lo  THAI CHARACTER DO DEK
    ี   Mn  THAI CHARACTER SARA II
    เ   Lo  THAI CHARACTER SARA E
    ป   Lo  THAI CHARACTER PO PLA
    ็   Mn  THAI CHARACTER MAITAIKHU
    ...
    ว   Lo  THAI CHARACTER WO WAEN
    35