Search code examples
unicodeglyph

Find characters that are similar glyphically in Unicode?


Lets say I have the characters Ú, Ù, Ü. All of them are similar glyphically to the English U.

Is there some list or algorithm to do this:

  • Given a Ú or Ù or Ü return the English U
  • Given a English U, return the list of all U-similar characters

I'm not sure if the code point of the Unicode characters is the same across all fonts? If it is, I suppose there could be some easy way and efficient to do this?

UPDATE

If you're using Ruby, there is a gem available unicode-confusable for this that may help in some cases.


Solution

  • This won't work for all conditions, but one way to get rid of most accents is to convert the characters to their decomposed form, then throw away the combining accents:

    # coding: utf8
    import unicodedata as ud
    s=u'U, Ù, Ú, Û, Ü, Ũ, Ū, Ŭ, Ů, Ű, Ų, Ư, Ǔ, Ǖ, Ǘ, Ǚ, Ǜ, Ụ, Ủ, Ứ, Ừ, Ử, Ữ, Ự'
    print ud.normalize('NFD',s).encode('ascii','ignore')
    

    Output

    U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U
    

    To find accent characters, use something like:

    import unicodedata as ud
    import string
    
    def asc(unichr):
        return ud.normalize('NFD',unichr).encode('ascii','ignore')
    
    U = u''.join(unichr(i) for i in xrange(65536))
    for c in string.letters:
        print u''.join(u for u in U if asc(u) == c)
    

    Output

    aàáâãäåāăąǎǟǡǻȁȃȧḁạảấầẩẫậắằẳẵặ
    bḃḅḇ
    cçćĉċčḉ
    dďḋḍḏḑḓ
    eèéêëēĕėęěȅȇȩḕḗḙḛḝẹẻẽếềểễệ
    fḟ
     :
    etc.