So, I'm having some problems with the following char –
in a port of imgui to kotlin
After digging the whole day into Charsets and encodings, I came down to my only hope: rely on the unicode codepoints.
That char on the jvm
"–"[0].toInt() // same as codePointAt()
returns codepoint u2013
On C, I'm not sure, but since this is what is being done:
const ImFontGlyph* ImFont::FindGlyph(ImWchar c) const
{
if (c >= IndexLookup.Size)
return FallbackGlyph;
const ImWchar i = IndexLookup.Data[c];
if (i == (ImWchar)-1)
return FallbackGlyph;
return &Glyphs.Data[i];
}
Where
typedef unsigned short ImWchar
and
ImVector<ImWchar> IndexLookup; // Sparse. Index glyphs by Unicode code-point.
So, doing this
char* a = "–";
int b = a[0];
returns codepoint u0096
As far as I read, it look like over 127
(0x7F
) we are in the "Extended Ascii" territory, which is bad, because it appears there are different versions/interpretation of it.
For example, this encoding table doesn't match my codepoint, but the Cp1252 encoding does, so I'm inclined to think that this is what is actually being used on C.
In the table at the bottom of the link just mentioned, you can actually see that 150
(decimal, count from the right column with the given number) corresponds indeed to 2013
(hex, I find it a little incoherent, but anyway).
To solve this, I tried to convert my String
s on Kotlin to the same encoding (ignoring for the moment that this is of course platform-specific), so for every c: Char
"$c".toByteArray(Charset.forName("Cp1252"))[0].toUnsignedInt
This works, but breaks rendering for foreign fonts, such as chinese, japanese, etc..
So, my question is: why the difference between u2013
on JVM and u0096
on C?
Which is the right way to deal with this?
At the moment I solved like this on Windows, I inserted this function before retrieving the char codepoint. It basically remaps all those chars which differs from ISO-8859-1. You can see them in this table, they are all those with that light grey border.
internal fun Char.remapCodepointIfProblematic(): Int {
val i = toInt()
return when (Platform.get()) {
/* https://en.wikipedia.org/wiki/Windows-1252#Character_set
* manually remap the difference from ISO-8859-1 */
Platform.WINDOWS -> when (i) {
// 8_128
0x20AC -> 128 // €
0x201A -> 130 // ‚
0x0192 -> 131 // ƒ
0x201E -> 132 // „
0x2026 -> 133 // …
0x2020 -> 134 // †
0x2021 -> 135 // ‡
0x02C6 -> 136 // ˆ
0x2030 -> 137 // ‰
0x0160 -> 138 // Š
0x2039 -> 139 // ‹
0x0152 -> 140 // Œ
0x017D -> 142 // Ž
// 9_144
0x2018 -> 145 // ‘
0x2019 -> 146 // ’
0x201C -> 147 // “
0x201D -> 148 // ”
0x2022 -> 149 // •
0x2013 -> 150 // –
0x2014 -> 151 // —
0x02DC -> 152 // ˜
0x2122 -> 153 // ™
0x0161 -> 154 // š
0x203A -> 155 // ›
0x0153 -> 156 // œ
0x017E -> 158 // ž
0x0178 -> 159 // Ÿ
else -> i
}
else -> i // TODO
}
}