java xml string xml-parsing sanitization

clean string from 'unit separator' (0x1f) character for xml

Ran into the following exception parsing XML generated from inputs:

org.xml.sax.SAXParseException: Zeichenreferenz "&#
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)

I traced the problem down to an input string containing the character 0x1f, an invisible "UNIT SEPARATOR" character: http://www.columbia.edu/kermit/ascii.html
I had to copy the input into a text file to make it visible:

Tested the input-string in other places and also ran into problems like:

Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: XML parsing: line 1, character 149, illegal xml character
at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:262)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1632)
at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:602)
at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:524)
at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7418)

What would be the best way to strip such characters from an input string, are there other problematic characters for XML which should be removed?

Solution

This is the solution I ended up with:

/** RegEx pattern of invalid xml 1.0 characters, ref : http://www.w3.org/TR/REC-xml/#charsets */
private static final Pattern INVALID_XML_CHAR_PATTERN = Pattern
        .compile("[^\\u0009\\u000A\\u000D\\u0020-\\uD7FF\\uE000-\\uFFFD\\x{10000}-\\x{10FFFF}]"); //$NON-NLS-1$

/**
 * sanitize the passed value for xml 1.0
 *
 * @param input input value to sanitize
 * @return null if input was not changed
 */
public static String sanitizeXmlChars(String input) {
    if (input == null || ("".equals(input))) { //$NON-NLS-1$
        return null;
    }
    Matcher matcher = INVALID_XML_CHAR_PATTERN.matcher(input);
    if (matcher.find()) {
        return matcher.replaceAll(""); //$NON-NLS-1$
    }
    return null;
}

Inspired by: https://www.rgagnon.com/javadetails/java-sanitize-xml-string.html

With a simple JUnit test:

public class StringUtilTest {

    @Test
    public void sanitizeXmlChars() {
        String goodXml = "<xml>value'<sub><![CDATA[Inhalt&auml;ä]]></sub></xml>"; //$NON-NLS-1$
        assertNull(StringUtil.sanitizeXmlChars(goodXml));

        // contains control character after <xml>
        String badXml = "<xml>" + (char) 31 + "value'<sub><![CDATA[Inhalt&auml;ä]]></sub></xml>"; //$NON-NLS-1$
        String result = StringUtil.sanitizeXmlChars(badXml);
        assertEquals(goodXml, result);

        String goodText = "This is a Text.\nWith two lines."; //$NON-NLS-1$
        assertNull(StringUtil.sanitizeXmlChars(goodXml));
        // contains control character after two
        badXml = "This is a Text.\nWith two " + (char) 31 + "lines."; //$NON-NLS-1$
        result = StringUtil.sanitizeXmlChars(badXml);
        assertEquals(goodText, result);

        goodText = "Text Text2"; //$NON-NLS-1$
        assertNull(StringUtil.sanitizeXmlChars(goodXml));

        badXml = "Text "; //$NON-NLS-1$
        // append control characters e.g. 30=>Record Separator 31=>Unit Separator
        for (int i = 1; i <= 31; i++) {
            // skip valid control characters: Horizontal Tab, Line Feed, Carriage Return
            if (i == 9 || i == 10 || i == 13) {
                continue;
            }
            badXml += String.valueOf((char) i);
        }
        badXml += "Text2";
        result = StringUtil.sanitizeXmlChars(badXml);
        assertEquals(goodText, result);
    }

}

Alternative solution, using a third party library e.g. apache commons-lang:

String cleanInput = StringEscapeUtils.escapeXml10(input)

https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html#escapeXml10(java.lang.String)