Search code examples
pythonpython-3.xunicodeunicode-normalization

get all unicode variations of a latin character


E.g., for the character "a", I want to get a string (list of chars) like "aàáâãäåāăą" (not sure if that example list is complete...) (basically all unicode chars with names "Latin Small Letter A with *").

Is there a generic way to get this?

I'm asking for Python, but if the answer is more generic, this is also fine, although I would appreciate a Python code snippet in any case. Python >=3.5 is fine. But I guess you need to have access to the Unicode database, e.g. the Python module unicodedata, which I would prefer over other external data sources.

I could imagine some solution like this:

def get_variations(char):
   import unicodedata
   name = unicodedata.name(char)
   chars = char
   for variation in ["WITH CEDILLA", "WITH MACRON", ...]:
      try: 
          chars += unicodedata.lookup("%s %s" % (name, variation))
      except KeyError:
          pass
   return chars

Solution

  • To start, get a collection of the Unicode combining diacritical characters; they're contiguous, so this is pretty easy, e.g.:

    # Unicode combining diacritical marks run from 768 to 879, inclusive
    combining_chars = ''.join(map(chr, range(768, 880)))
    

    Now define a function that attempts to compose each one with a base ASCII character; when the composed normal form is length 1 (meaning the ASCII + combining became a single Unicode ordinal), save it:

    import unicodedata
    
    def get_unicode_variations(letter):
        if len(letter) != 1:
            raise ValueError("letter must be a single character to check for variations")
        variations = []
        # We could just loop over map(chr, range(768, 880)) without caching
        # in combining_chars, but that increases runtime ~20%
        for combiner in combining_chars:
            normalized = unicodedata.normalize('NFKC', letter + combiner)
            if len(normalized) == 1:
                variations.append(normalized)
        return ''.join(variations)
    

    This has the advantage of not trying to manually perform string lookups in the unicodedata DB, and not needing to hardcode all possible descriptions of the combining characters. Anything that composes to a single character gets included; runtime for the check on my machine comes in under 50 µs, so if you're not doing this too often, the cost is reasonable (you could decorate with functools.lru_cache if you intend to call it repeatedly with the same arguments and want to avoid recomputing it every time).

    If you want to get everything built out of one of these characters, a more exhaustive search can find it, but it'll take longer (functools.lru_cache would be nigh mandatory unless it's only ever called once per argument):

    import functools
    import sys
    import unicodedata
    
    @functools.lru_cache(maxsize=None)
    def get_unicode_variations_exhaustive(letter): 
        if len(letter) != 1:
            raise ValueError("letter must be a single character to check for variations")
        variations = [] 
        for testlet in map(chr, range(sys.maxunicode)): 
            if letter in unicodedata.normalize('NFKD', testlet) and testlet != letter: 
                variations.append(testlet) 
        return ''.join(variations) 
    

    This looks for any character that decomposes into a form that includes the target letter; it does mean that searching the first time takes roughly a third of a second, and the result includes stuff that isn't really just a modified version of the character (e.g. 'L''s result will include , which isn't really a "modified 'L'), but it's as exhaustive as you can get.