Search code examples
javaxmlserializationutfopensaml

XML serialisation for utf characters contains ncr for surrogate pair


I have a payload that may contain some text like 𠮷. When I serialize this payload to xml, I expect such characters to be not encoded to ncr or encoded as 𠮷 a single character but I observe �� two (surrogate) characters together representing one same unicode character.

A minimal example demonstrating this problem is:

package something...;

import net.shibboleth.utilities.java.support.xml.SerializeSupport;
import org.opensaml.core.config.InitializationException;
import org.opensaml.core.config.InitializationService;
import org.opensaml.core.xml.schema.XSString;
import org.opensaml.core.xml.schema.impl.XSStringBuilder;
import org.opensaml.core.xml.util.XMLObjectSupport;
import org.opensaml.saml.saml2.core.Attribute;
import org.opensaml.saml.saml2.core.AttributeValue;
import org.w3c.dom.Document;

/** Generates Response.xml using openSAML library version 3. */
final class SamlResponseGenerator {
  public static void main(String[] args) throws Exception {
    try {
      InitializationService.initialize();
    } catch (InitializationException e) {
      throw new IllegalStateException(e);
    }

    // First Name Attribute.
    Attribute firstNameAttribute =
        (Attribute) XMLObjectSupport.buildXMLObject(Attribute.DEFAULT_ELEMENT_NAME);
    firstNameAttribute.setName("FirstName");
    XSStringBuilder firstNameTestValueStringBuilder = new XSStringBuilder();
    XSString firstNameAttributeValueXS =
        firstNameTestValueStringBuilder.buildObject(
            AttributeValue.DEFAULT_ELEMENT_NAME, XSString.TYPE_NAME);
    String myName = "𠮷";
    firstNameAttributeValueXS.setValue(myName);
    firstNameAttribute.getAttributeValues().add(firstNameAttributeValueXS);

    Document doc = XMLObjectSupport.marshall(firstNameAttribute).getOwnerDocument();
    String docString = SerializeSupport.nodeToString(doc);
    System.out.println(docString);
  }

  private SamlResponseGenerator() {}
}

List of solutions tried:

  1. Transformers (Does not work)
  2. XMLHelper.writeNode with a filewriter opened with utf-8 encoding format Does not work
  3. Postprocess xml output (Works but this is hacky)
  4. Tried with OpenSAML v2 and v3 (Does not work)
  5. XML marshalling of surrogate pairs

Am I using an incorrect serializer or a incorrect config. What may I do to get the desired output (similar to legacy server).

Doesn't this wikipedia article suggests that surrogate pairs are not allowed in numeric character reference notation.


Solution

  • Exclusively using transformer factory com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl gives the expected output.