Search code examples
javaencodingutf-8iso-8859-1

Convert combining diaereses to ISO 8859-1


This is similar to this question, but I specifically need to know how to convert to ISO-8859-1 format, not UTF-8.

Short question: I need a character with combining diaereses converted to the Latin-1 equivalent (if it exists).

Longer question: I have German strings that contain combining diaereses (UTF-8: [cc][88] AKA UTF code point U+0308), but my database only supports ISO-8859-1 (e.g. Latin-1). Because the characters/combining diaereses are "decomposed", I can't just "convert" to ISO-8859-1 because the byte sequence [cc][88] acts on the preceding character, which may not have a corresponding character in ISO-8859-1.

I tried this code:

import java.nio.charset.Charset;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;

//ü has combining diaereses
String s = "für"
Charset utf8charset = Charset.forName("UTF-8");
Charset iso88591charset = Charset.forName("ISO-8859-1");

ByteBuffer inputBuffer = ByteBuffer.wrap(s.getBytes());

// decode UTF-8
CharBuffer data = utf8charset.decode(inputBuffer);

// encode ISO-8559-1
ByteBuffer outputBuffer = iso88591charset.encode(data);
byte[] outputData = outputBuffer.array();

isoString = new String(outputData);

//isoString is "fu?r"

But it just fails to encode the combining diaereses rather than seeing that it could convert to U+00F6/[c3][bc]. Is there a library that can detect when a character followed by combining diaereses can map to an existing ISO-8859-1 character? (Preferably in Java)


Solution

  • You need to normalize before you encode.

    Use the Normalizer class to convert to a decomposed form and then encode.