Search code examples
javaencodingcharacter-encodingspecial-characters

Umlaut Characters: How do I convert “Ã<0x9c>” to “Ü” in Java?


I have a string that contains "Ã<0x9c>". It's not exactly those characters, the best way I can describe it is:

  • Looks like "Ã" in Postgres DB / Datagrip Viewer.
  • Looks like "Ã<0x9c>" when copied from Datagrip into Sublime Text.

I have another program that does a String match and it expects this “Ü”. So I assume "Ã<0x9c>" is equivalent to “Ü” in some form.

My question is how do I convert "Ã<0x9c>" to “Ü” in Java for the set of these generic characters? I don't want to solve it for just this one character.

So if "Ã<0x9c>" is encoding1 and “Ü” is encoding2. How do I convert from encoding1 to encoding2.

Also, what would encoding1 and encoding2 be called? My googling says encoding2 are called umlaut characters but I don't know what encoding1 is called.


Solution

  • This is probably a conflating of the ISO-8859-1 (Latin-1) and UTF-8 encodings. Let's see:

    In UTF-8, the Unicode character U+00DC (Ü) is encoded as 0xC3 0x9C.

    In Latin-1, 0xC3 is the letter A with tilde (Ã), but 0xC9 doesn't have a corresponding character (glyph). That's why you see Ã<0xC9>.

    So, the string (or byte array) you have is probably encoded in in UTF-8, and should be decoded as such.

    Since you asked for how to convert, here's a demonstration:

    public static void main(String[] args) throws Exception {
        byte[] input = {(byte) 0xc3, (byte)0x9c};
    
        var s = new String(input, StandardCharsets.UTF_8);
        System.out.println(s);
    
        var out = new ByteArrayOutputStream();
        var writer = new OutputStreamWriter(out, StandardCharsets.ISO_8859_1);
    
        writer.write(s);
        writer.flush();
        byte[] result = out.toByteArray();
    
        System.out.println(result.length);
        System.out.printf("%x\n", result[0]);
    }
    

    which prints

    Ü
    1
    dc
    

    Oops, forgot to mention that 0xDC is same character in Unicode and Latin-1, "Latin Capital Letter U with Diaeresis".