I want to obtain a list of all the unique characters in a text. A particularity of the text is that it includes composed characters like s̈, b̃. So when I split the text, the special characters are separated. For example, this character s̈ is separated into two characters s and ¨.
This is an example of the text I want to process.
sentence = "nejon ámas̈hó T̃iqu c̈ab̃op"
print(sentence)
print(list[set(sentence)])
I want to obtain a list with the unique characters. For this sentence, this list should be
expected_list = ['a', 'á', 'b̃', 'c̈', 'e', 'h', 'i', 'j', 'm', 'n', 'o', 'ó', 'p', 'q', 's̈', 'T̃', 'u' ]
but it is
actual_list = ['j', 'p', 'c', 'n', 'a', ' ', 'i', 'á', 'o', 'T', 'u', '̃', 'h', '̈', 'q', 's', 'e', 'm', 'b', 'ó']
I was reading that I can normalize the special characters as follows
import unicodedata
# Only for the character s̈
print(ascii(unicodedata.normalize('NFC', '\u0073\u00a8'))) #prints 's\xa8'
But I don't know how to continue. Any help would be greatly appreciated.
Handling composed characters in Python can be a bit tricky due to the nature of how they are encoded. Try the grapheme
library, which specifically deals with grapheme clusters (textual units that are displayed as a single character)
Install the grapheme
library using pip:
pip install grapheme
or I prefer this way (to make sure it's installing to the current python binary dirs)
python3 -m pip install grapheme
Then, you can use it to extract the unique grapheme clusters from the sentence:
import grapheme
sentence = "nejon ámas̈hó T̃iqu c̈ab̃op"
unique_characters = list(grapheme.graphemes(sentence))
print(unique_characters)