Search code examples
javaunicodecodepoint

Java - convert 2 code points char to a single code point char


I am processing a text that then I have to link to files. The text has ä ( unicode points 97 + 776 ) but the FS has the file written as ä ( unicode point 228 ). Is there a way to convert 97 + 776 to 228? I believe these should be surrogate pairs and is UTF-8 encoded, I've tried getBytes as UTF-16 or other encodings, but nothing worked. I can't even paste the 2 code points char here correctly - it gets processed to the single char, but the hex representation is still "61 cc 88", what exactly is this "ä"?


Solution

  • The one with two codepoints isn't a surrogate pair, but rather an "a" with a combining diacritic "¨", resulting in the same visual appearance (in fonts that support it) as the precomposed (= character and diacritic in one) character "ä".

    To convert between the two you need something called a Normalizer. Java's built-in class java.text.Normalizer should help you with that, have a look at https://stackoverflow.com/a/58403649/12344762 for more infomation.