Lets say I have the characters Ú, Ù, Ü. All of them are similar glyphically to the English U.
Is there some list or algorithm to do this:
I'm not sure if the code point of the Unicode characters is the same across all fonts? If it is, I suppose there could be some easy way and efficient to do this?
UPDATE
If you're using Ruby, there is a gem available unicode-confusable for this that may help in some cases.
This won't work for all conditions, but one way to get rid of most accents is to convert the characters to their decomposed form, then throw away the combining accents:
# coding: utf8
import unicodedata as ud
s=u'U, Ù, Ú, Û, Ü, Ũ, Ū, Ŭ, Ů, Ű, Ų, Ư, Ǔ, Ǖ, Ǘ, Ǚ, Ǜ, Ụ, Ủ, Ứ, Ừ, Ử, Ữ, Ự'
print ud.normalize('NFD',s).encode('ascii','ignore')
U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U
To find accent characters, use something like:
import unicodedata as ud
import string
def asc(unichr):
return ud.normalize('NFD',unichr).encode('ascii','ignore')
U = u''.join(unichr(i) for i in xrange(65536))
for c in string.letters:
print u''.join(u for u in U if asc(u) == c)
aàáâãäåāăąǎǟǡǻȁȃȧḁạảấầẩẫậắằẳẵặ
bḃḅḇ
cçćĉċčḉ
dďḋḍḏḑḓ
eèéêëēĕėęěȅȇȩḕḗḙḛḝẹẻẽếềểễệ
fḟ
:
etc.