Search code examples
pythonpython-3.xunicodegrapheme

Given a list of Unicode code points, how does one split them into a list of Unicode characters?


I'm writing a lexical analyzer for Unicode text. Many Unicode characters require multiple code points (even after canonical composition). For example, tuple(map(ord, unicodedata.normalize('NFC', 'ā́'))) evaluates to (257, 769). How can I know where the boundary is between two characters? Additionally, I'd like to store the unnormalized version of the text. My input is guaranteed to be Unicode.

So far, this is what I have:

from unicodedata import normalize

def split_into_characters(text):
    character = ""
    characters = []

    for i in range(len(text)):
        character += text[i]

        if len(normalize('NFKC', character)) > 1:
            characters.append(character[:-1])
            character = character[-1]

    if len(character) > 0:
        characters.append(character)

    return characters

print(split_into_characters('Puélla in vī́llā vīcī́nā hábitat.'))

This incorrectly prints the following:

['P', 'u', 'é', 'l', 'l', 'a', ' ', 'i', 'n', ' ', 'v', 'ī', '́', 'l', 'l', 'ā', ' ', 'v', 'ī', 'c', 'ī', '́', 'n', 'ā', ' ', 'h', 'á', 'b', 'i', 't', 'a', 't', '.']

I expect it to print the following:

['P', 'u', 'é', 'l', 'l', 'a', ' ', 'i', 'n', ' ', 'v', 'ī́', 'l', 'l', 'ā', ' ', 'v', 'ī', 'c', 'ī́', 'n', 'ā', ' ', 'h', 'á', 'b', 'i', 't', 'a', 't', '.']

Solution

  • The boundaries between perceived characters can be identified with Unicode's Grapheme Cluster Boundary algorithm. Python's unicodedata module doesn't have the necessary data for the algorithm (the Grapheme_Cluster_Break property), but complete implementations can be found in libraries like PyICU and uniseg.