java encoding character-encoding special-characters

Umlaut Characters: How do I convert “Ã<0x9c>” to “Ü” in Java?

I have a string that contains "Ã<0x9c>". It's not exactly those characters, the best way I can describe it is:

Looks like "Ã" in Postgres DB / Datagrip Viewer.
Looks like "Ã<0x9c>" when copied from Datagrip into Sublime Text.

I have another program that does a String match and it expects this “Ü”. So I assume "Ã<0x9c>" is equivalent to “Ü” in some form.

My question is how do I convert "Ã<0x9c>" to “Ü” in Java for the set of these generic characters? I don't want to solve it for just this one character.

So if "Ã<0x9c>" is encoding1 and “Ü” is encoding2. How do I convert from encoding1 to encoding2.

Also, what would encoding1 and encoding2 be called? My googling says encoding2 are called umlaut characters but I don't know what encoding1 is called.

Solution

This is probably a conflating of the ISO-8859-1 (Latin-1) and UTF-8 encodings. Let's see:

In UTF-8, the Unicode character U+00DC (Ü) is encoded as 0xC3 0x9C.

In Latin-1, 0xC3 is the letter A with tilde (Ã), but 0xC9 doesn't have a corresponding character (glyph). That's why you see Ã<0xC9>.

So, the string (or byte array) you have is probably encoded in in UTF-8, and should be decoded as such.

Since you asked for how to convert, here's a demonstration:

public static void main(String[] args) throws Exception {
    byte[] input = {(byte) 0xc3, (byte)0x9c};

    var s = new String(input, StandardCharsets.UTF_8);
    System.out.println(s);

    var out = new ByteArrayOutputStream();
    var writer = new OutputStreamWriter(out, StandardCharsets.ISO_8859_1);

    writer.write(s);
    writer.flush();
    byte[] result = out.toByteArray();

    System.out.println(result.length);
    System.out.printf("%x\n", result[0]);
}

which prints

Ü
1
dc

Oops, forgot to mention that 0xDC is same character in Unicode and Latin-1, "Latin Capital Letter U with Diaeresis".