Search code examples
javaxmlunicodeencodingunicode-escapes

XML Escaping ignores accentuated characters


I am trying to send a POST request, where the request body contains an XML. The receiving API demands that any special characters should encoded with numeric xml-entities.

Lets take the example: İlkay Gündoğan

After xml-escaping with standard libraries like org.apache.commons.text.StringEscapeUtils or using Jsoup with XML-Parser, it only produces:

İlkay Gündoğan, but it ignores İ and ğ. I already read the documentation of those mentioned libs and read that only a certain range of characters is escaped.

  • Why are those libs only converting specific ranges ?
  • is there any lib for jvm, which supports escaping accentuated characters like İ and ğ.

I already tried sending a manual crafted example (İlkay Gündoğan) to the recv. API and it worked as expected.

All values are written and read in UTF-8.


Solution

  • If the XML encoding is UTF-8 (the default), then converting special characters to numeric entities is not needed. So you have a dubious receiver. escapeXml11 is indeed limited as the javadocs say.

    To translate all non-ASCII characters for a String xml:

    xml = xml.codePoints()
        .map(cp -> cp < 128 ? Character.toString(cp) : String.format("&#%d;", cp))
        .collect(Collectors.joining());
    

    You might even set the encoding="US-ASCII".