Search code examples
unicodextermmonospace

full list of all subscripts and diacritical marks in unicode


Answered: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt is a a list of all unicode chars, and 0xcc99 # U+0319 COMBINING RIGHT TACK BELOW is somewhat like a comma for a monospaced font..(example: 10̡9̡8̡7̡6̡5̡4̡3̡2̡1̡0̡ )

Is there a complete list of all unicode characters along with their verbal descriptions, e.g. a list of lines like ... 0xcc99 # U+0319 COMBINING RIGHT TACK BELOW ..

Particularly, what diacritical mark do I use to type 1. or 2o3 ? The motivation is that I want to be able to add a point or comma in a monospace font in a terminal, without actually adding a character.


Solution

  • There is no complete list of all Unicode characters along with their verbal descriptions, not even a list of them with their Unicode names. The UnicodeData.txt files refers to large ranges of characters generically, e.g.

    4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
    9FCB;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;
    

    and

    AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
    D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
    

    It would be possible to construct a complete list with Unicode names, but what would be the purpose? The Unicode names, such as COMBINING PALATALIZED HOOK BELOW, are identifiers, not descriptions. Taken as English texts, some of them are intuitively descriptive, some are very vague, some are obscure, and some are outright wrong—and will never be changed, due to the stability principle. The principle is largely necessitated by the use of Unicode names in programs; they must not be changed, for the same reasons why the Unicode numbers must not be changed.

    Some of the Unicode names for diacritics, too, are misleading or at least incomplete. The shape of a diacritic cannot be inferred from the Unicode name alone, and the shape may even vary a lot (e.g., t with caron is ť in lowercase, with the diacritic looking like a conna, whereas the corresponding uppercase letter Ť has... well, a caron-like caron).

    Using characters like U+0319 and U+0321 in your text data implies that will require a relatively extensive font and relatively advanced rendering software that displays combining diacritic marks well. Moreover, if you intend to use them in meanings and contexts they were not intended for (they are meant for use in phonetic notations where they are associated with letters to indicate features of pronunciation), you may need poor software that implements them improperly (considering the intended use and rendering). For example, U+0319 is supposed to appear below a letter