Search code examples
unicodenormalize

How can I normalize fonts?


Users sometimes use weird ASCII characters in a program, and I was wondering if there was a way to "normalize" it.

So basically, if the input ᴀʙᴄᴅᴇꜰɢ, the output would be ABCDEFG. Is there a dictionary that exists somewhere that does something like this? If not, is there a better method than just doing something like str.replace("ᴀ", "A") for all the different "fonts"?

This isn't a language specific question -- if something doesn't exist like this than I guess the next step is to create a dictionary myself.


Solution

  • Yes.

    BTW—The technical terms are: Latin Capital Letters from the C0 Controls and Basic Latin block and the Latin Letter Small Capitals from the Phonetic Extensions block.

    Anyway, the general topic for your question is Unicode confusables. The link is for a mapping. Uncode.org has more material on confusables and everything else Unicode.

    (Normalization is always something to consider when processing Unicode text, but it doesn't particularly relate to this issue.)