I have a string that contains "Ã<0x9c>". It's not exactly those characters, the best way I can describe it is:
I have another program that does a String match and it expects this “Ü”. So I assume "Ã<0x9c>" is equivalent to “Ü” in some form.
My question is how do I convert "Ã<0x9c>" to “Ü” in Java for the set of these generic characters? I don't want to solve it for just this one character.
So if "Ã<0x9c>" is encoding1 and “Ü” is encoding2. How do I convert from encoding1 to encoding2.
Also, what would encoding1 and encoding2 be called? My googling says encoding2 are called umlaut characters but I don't know what encoding1 is called.
This is probably a conflating of the ISO-8859-1 (Latin-1) and UTF-8 encodings. Let's see:
In UTF-8, the Unicode character U+00DC (Ü) is encoded as 0xC3 0x9C.
In Latin-1, 0xC3 is the letter A with tilde (Ã), but 0xC9 doesn't have a corresponding character (glyph). That's why you see Ã<0xC9>
.
So, the string (or byte array) you have is probably encoded in in UTF-8, and should be decoded as such.
Since you asked for how to convert, here's a demonstration:
public static void main(String[] args) throws Exception {
byte[] input = {(byte) 0xc3, (byte)0x9c};
var s = new String(input, StandardCharsets.UTF_8);
System.out.println(s);
var out = new ByteArrayOutputStream();
var writer = new OutputStreamWriter(out, StandardCharsets.ISO_8859_1);
writer.write(s);
writer.flush();
byte[] result = out.toByteArray();
System.out.println(result.length);
System.out.printf("%x\n", result[0]);
}
which prints
Ü
1
dc
Oops, forgot to mention that 0xDC is same character in Unicode and Latin-1, "Latin Capital Letter U with Diaeresis".