I have a payload that may contain some text like 𠮷
. When I serialize this payload to xml, I expect such characters to be not encoded to ncr or encoded as 𠮷 a single character
but I observe �� two (surrogate) characters together representing one same unicode character
.
A minimal example demonstrating this problem is:
package something...;
import net.shibboleth.utilities.java.support.xml.SerializeSupport;
import org.opensaml.core.config.InitializationException;
import org.opensaml.core.config.InitializationService;
import org.opensaml.core.xml.schema.XSString;
import org.opensaml.core.xml.schema.impl.XSStringBuilder;
import org.opensaml.core.xml.util.XMLObjectSupport;
import org.opensaml.saml.saml2.core.Attribute;
import org.opensaml.saml.saml2.core.AttributeValue;
import org.w3c.dom.Document;
/** Generates Response.xml using openSAML library version 3. */
final class SamlResponseGenerator {
public static void main(String[] args) throws Exception {
try {
InitializationService.initialize();
} catch (InitializationException e) {
throw new IllegalStateException(e);
}
// First Name Attribute.
Attribute firstNameAttribute =
(Attribute) XMLObjectSupport.buildXMLObject(Attribute.DEFAULT_ELEMENT_NAME);
firstNameAttribute.setName("FirstName");
XSStringBuilder firstNameTestValueStringBuilder = new XSStringBuilder();
XSString firstNameAttributeValueXS =
firstNameTestValueStringBuilder.buildObject(
AttributeValue.DEFAULT_ELEMENT_NAME, XSString.TYPE_NAME);
String myName = "𠮷";
firstNameAttributeValueXS.setValue(myName);
firstNameAttribute.getAttributeValues().add(firstNameAttributeValueXS);
Document doc = XMLObjectSupport.marshall(firstNameAttribute).getOwnerDocument();
String docString = SerializeSupport.nodeToString(doc);
System.out.println(docString);
}
private SamlResponseGenerator() {}
}
List of solutions tried:
Am I using an incorrect serializer or a incorrect config. What may I do to get the desired output (similar to legacy server).
Doesn't this wikipedia article suggests that surrogate pairs are not allowed in numeric character reference notation.
Exclusively using transformer factory com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl
gives the expected output.