java xml unicode marshalling surrogate-pairs

XML marshalling of surrogate pairs

I have came across a strange behavior of marshaller when surrogate pairs are involved. Why JAXB marshaller adds unnecessary (and invalid) XML entity? When I try to marshall the following:

\uD83D\uDCB3, e.g. 55357 56499 code points

Mashaller outputs 128179 code point (that is valid and represents both surrogate pairs in XML) and unnecessary 56499 (which is not a valid XML entity and represents low part of pair). How can I configure marshaller to achieve valid XML entities in output, or do I need just upgrade libraries? I am using Java 8.

Sample reproducing code:

    String inputSurrogate = "\uD83D\uDCB3";
    JAXBContext jaxbContext = JAXBContext.newInstance(Customer.class);

    Marshaller jaxbMarshaller = jaxbContext.createMarshaller();
    jaxbMarshaller.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, true);
    jaxbMarshaller.setProperty(Marshaller.JAXB_ENCODING, "UTF-8");
    StringWriter sw = new StringWriter();

    Customer customer = new Customer();
    customer.setText(inputSurrogate);
    jaxbMarshaller.marshal(customer, sw);
    String xmlString = sw.toString();

    System.out.println(xmlString);
    for (int i = 0; i < xmlString.length(); i++) {
        int ch = xmlString.codePointAt(i);
        System.out.print(ch);
        System.out.print("|");
    }

Output (note the |128179|56499|, the 56499 is unnecessary and invalid to my understanding):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<customer>
    <text>💳</text>
</customer>

60|63|120|109|108|32|118|101|114|115|105|111|110|61|34|49|46|48|34|32|101|110|99|111|100|105|110|103|61|34|85|84|70|45|56|34|32|115|116|97|110|100|97|108|111|110|101|61|34|121|101|115|34|63|62|10|60|99|117|115|116|111|109|101|114|62|10|32|32|32|32|60|116|101|120|116|62|128179|56499|60|47|116|101|120|116|62|10|60|47|99|117|115|116|111|109|101|114|62|10|

Solution

The unwanted "codepoint" in your output is the artifact of a bug in your output code.

Java strings have an interface with a bias towards UTF-16. All offsets and all methods working with the char data type pretend that the string is an array of UTF-16 code units.

The same goes for string escaping like "\uD83D\uDCB3". It does not contain two Unicode code points. Rather it contains two UTF-16 code units that together form a single code point, namely the code point for the credit card symbol.

Your output code mixes code points and UTF-16 code units by accessing code points using codePointAt() but incrementing the offset (variable i) by code units. Thus, the credit card code point is accessed twice: once correctly and the second time incorrectly (with the offset pointing into the middle of the surrogate pair).

The correct code looks like so:

int offset = 0;
while (offset < xmlString.length()) {
    int codePoint = xmlString.codePointAt(offset);
    System.out.print(codePoint);
    System.out.print("|");
    offset += Character.charCount(codePoint);
}