Search code examples
unicodejuliacodepoint

Different results using codepoint() with input arguments with \dot


I am trying to see whether the \dot operator can be detected from a symbol in Julia, here is what I have tried:

The following two blocks return different results

julia> [codepoint(i) for i in string(:ẋ)]
1-element Vector{UInt32}:
 0x00001e8b
julia> [codepoint(i) for i in "ẋ"]
2-element Vector{UInt32}:
 0x00000078
 0x00000307

Ideally I would have a symbol at the beginning, not a string, so I need to use the first method, but that will not return the 0x307 which is the unicode of \dot, making it hard to detect \dot.

So what is the mechanism behind the difference? Thank you.


Solution

  • Both results are equivalent.

    Humans are complex, languages also, and so Unicode was required to have complex rules.

    In your case you have two representation:

    • U+1E8B (LATIN SMALL LETTER X WITH DOT ABOVE)
    • U+0087 (LATIN SMALL LETTER X) + U+0307 (COMBINING DOT ABOVE)

    Both are considered equivalent on Unicode. Note: when comparing strings, it is good to normalize strings. Unfortunately there are two main normalization:

    • NFD: Normalization Form Canonical Decomposition, so the second case. If possible always decompose characters, into base + modifier). This normalization is preferred by Apple, and it was the original idea in Unicode.
    • NFC: Normalization Form Canonical Composition. If there is a way to combine characters, it is done. There are rules on how to make it, if there are various modifiers (so which precedence). This method is preferred by most of other operating systems.
    • and the K version (Compatibility instead of canonical), but it is more tricky: there are various reason for compatibility. So they are usually not used for display but for searching text).

    See https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization

    The display engines (layout engine, text shapening, glyph display, font metadata) will probably make the same symbol (each font has own preference on which normalization they expect data, but then they will try to find a combined glyph).

    I think in your case, you may have two different variant in the text file. One using two characters, and one with a single character. It happen often when copying characters (some editors prefer one normalization compared to the other).

    In your case, I think you should normalize the string, see e.g. Unicode.normalize in https://docs.julialang.org/en/v1/stdlib/Unicode/

    And we are using Latin characters, so in the easy part of Unicode (but for being one of the few scripts with upper case and lower case).