Search code examples
javaencodingdom4j

Converting document encoding when reading with dom4j


Is there any way I can convert a document being parsed by dom4j's SAXReader from the ISO-8859-2 encoding to UTF-8? I need that to happen while parsing, so that the objects created by dom4j are already Unicode/UTF-8 and running code such as:

"some text".equals(node.getText());

returns true.


Solution

  • This is done automatically by dom4j. All String instances in Java are in a common, decoded form; once a String is created, it isn't possible to tell what the original character encoding was (or even if the string was created from encoded bytes).

    Just make sure that the XML document has the character encoding specified (which is required unless it is UTF-8).