Search code examples
javackotlincodepointimgui

Codepoint mismatch between Java and C


So, I'm having some problems with the following char in a port of imgui to kotlin

After digging the whole day into Charsets and encodings, I came down to my only hope: rely on the unicode codepoints.

That char on the jvm

"–"[0].toInt() // same as codePointAt()

returns codepoint u2013

On C, I'm not sure, but since this is what is being done:

const ImFontGlyph* ImFont::FindGlyph(ImWchar c) const
{
    if (c >= IndexLookup.Size)
        return FallbackGlyph;
    const ImWchar i = IndexLookup.Data[c];
    if (i == (ImWchar)-1)
        return FallbackGlyph;
    return &Glyphs.Data[i];
}

Where

typedef unsigned short ImWchar

and

ImVector<ImWchar> IndexLookup; // Sparse. Index glyphs by Unicode code-point.

So, doing this

char* a = "–";
int b = a[0];

returns codepoint u0096

As far as I read, it look like over 127(0x7F) we are in the "Extended Ascii" territory, which is bad, because it appears there are different versions/interpretation of it.

For example, this encoding table doesn't match my codepoint, but the Cp1252 encoding does, so I'm inclined to think that this is what is actually being used on C.

In the table at the bottom of the link just mentioned, you can actually see that 150 (decimal, count from the right column with the given number) corresponds indeed to 2013 (hex, I find it a little incoherent, but anyway).

To solve this, I tried to convert my Strings on Kotlin to the same encoding (ignoring for the moment that this is of course platform-specific), so for every c: Char

"$c".toByteArray(Charset.forName("Cp1252"))[0].toUnsignedInt

This works, but breaks rendering for foreign fonts, such as chinese, japanese, etc..

So, my question is: why the difference between u2013 on JVM and u0096 on C?

Which is the right way to deal with this?


Solution

  • At the moment I solved like this on Windows, I inserted this function before retrieving the char codepoint. It basically remaps all those chars which differs from ISO-8859-1. You can see them in this table, they are all those with that light grey border.

    internal fun Char.remapCodepointIfProblematic(): Int {
        val i = toInt()
        return when (Platform.get()) {
            /*  https://en.wikipedia.org/wiki/Windows-1252#Character_set
             *  manually remap the difference from  ISO-8859-1 */
            Platform.WINDOWS -> when (i) {
                // 8_128
                0x20AC -> 128 // €
                0x201A -> 130 // ‚
                0x0192 -> 131 // ƒ
                0x201E -> 132 // „
                0x2026 -> 133 // …
                0x2020 -> 134 // †
                0x2021 -> 135 // ‡
                0x02C6 -> 136 // ˆ
                0x2030 -> 137 // ‰
                0x0160 -> 138 // Š
                0x2039 -> 139 // ‹
                0x0152 -> 140 // Œ
                0x017D -> 142 // Ž
                // 9_144
                0x2018 -> 145 // ‘
                0x2019 -> 146 // ’
                0x201C -> 147 // “
                0x201D -> 148 // ”
                0x2022 -> 149 // •
                0x2013 -> 150 // –
                0x2014 -> 151 // —
                0x02DC -> 152 // ˜
                0x2122 -> 153 // ™
                0x0161 -> 154 // š
                0x203A -> 155 // ›
                0x0153 -> 156 // œ
                0x017E -> 158 // ž
                0x0178 -> 159 // Ÿ
                else -> i
            }
            else -> i // TODO
        }
    }