I would like to split a string which contains accented characters into characters without breaking the accent and the letter apart.
A simple example is
>>> o = u"šnjiwgetit"
>>> print u" ".join(o)
s ̌ n j i w g e t i t
or
>>> print list(o)
[u's', u'\u030c', u'n', u'j', u'i', u'w', u'g', u'e', u't', u'i', u't']
Whereas I would like the result to be š n j i w g e t i t
so that the accent stays on top of the consonant.
The solution should work even with more difficult characters such as h̭ɛ̮ŋkkɐᴅ
You can use regex to group the characters. Here is a sample code for doing so:
import re
pattern = re.compile(r'(\w[\u02F3\u1D53\u0300\u2013\u032E\u208D\u203F\u0311\u0323\u035E\u031C\u02FC\u030C\u02F9\u0328\u032D:\u02F4\u032F\u0330\u035C\u0302\u0327\u03572\u0308\u0351\u0304\u02F2\u0352\u0355\u00B7\u032C\u030B\u2019\u0339\u00B4\u0301\u02F1\u0303\u0306\u030A7\u0325\u0307\u0354`\u02F0]+|\w|\W)', re.UNICODE | re.IGNORECASE)
In case you had some accents missing, add them the pattern.
Then, you can split words into characters as follows.
print(list(pattern.findall('šnjiwgetit')))
['š', 'n', 'j', 'i', 'w', 'g', 'e', 't', 'i', 't'
print(list(pattern.findall('h̭ɛ̮ŋkkɐᴅ')))
['h̭', 'ɛ̮', 'ŋ', 'k', 'k', 'ɐ', 'ᴅ']
If you are using Python2, add from __future__ import unicode_literals
at the beginning of the file.