Search code examples
javaxmlstringxml-parsingsanitization

clean string from 'unit separator' (0x1f) character for xml


Ran into the following exception parsing XML generated from inputs:

org.xml.sax.SAXParseException: Zeichenreferenz "&#
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)

I traced the problem down to an input string containing the character 0x1f, an invisible "UNIT SEPARATOR" character: http://www.columbia.edu/kermit/ascii.html
I had to copy the input into a text file to make it visible:
enter image description here

Tested the input-string in other places and also ran into problems like:

Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: XML parsing: line 1, character 149, illegal xml character
at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:262)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1632)
at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:602)
at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:524)
at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7418)

What would be the best way to strip such characters from an input string, are there other problematic characters for XML which should be removed?


Solution

  • This is the solution I ended up with:

    /** RegEx pattern of invalid xml 1.0 characters, ref : http://www.w3.org/TR/REC-xml/#charsets */
    private static final Pattern INVALID_XML_CHAR_PATTERN = Pattern
            .compile("[^\\u0009\\u000A\\u000D\\u0020-\\uD7FF\\uE000-\\uFFFD\\x{10000}-\\x{10FFFF}]"); //$NON-NLS-1$
    
    /**
     * sanitize the passed value for xml 1.0
     *
     * @param input input value to sanitize
     * @return null if input was not changed
     */
    public static String sanitizeXmlChars(String input) {
        if (input == null || ("".equals(input))) { //$NON-NLS-1$
            return null;
        }
        Matcher matcher = INVALID_XML_CHAR_PATTERN.matcher(input);
        if (matcher.find()) {
            return matcher.replaceAll(""); //$NON-NLS-1$
        }
        return null;
    }
    

    Inspired by: https://www.rgagnon.com/javadetails/java-sanitize-xml-string.html

    With a simple JUnit test:

    public class StringUtilTest {
    
        @Test
        public void sanitizeXmlChars() {
            String goodXml = "<xml>value'<sub><![CDATA[Inhalt&auml;ä]]></sub></xml>"; //$NON-NLS-1$
            assertNull(StringUtil.sanitizeXmlChars(goodXml));
    
            // contains control character after <xml>
            String badXml = "<xml>" + (char) 31 + "value'<sub><![CDATA[Inhalt&auml;ä]]></sub></xml>"; //$NON-NLS-1$
            String result = StringUtil.sanitizeXmlChars(badXml);
            assertEquals(goodXml, result);
    
            String goodText = "This is a Text.\nWith two lines."; //$NON-NLS-1$
            assertNull(StringUtil.sanitizeXmlChars(goodXml));
            // contains control character after two
            badXml = "This is a Text.\nWith two " + (char) 31 + "lines."; //$NON-NLS-1$
            result = StringUtil.sanitizeXmlChars(badXml);
            assertEquals(goodText, result);
    
            goodText = "Text Text2"; //$NON-NLS-1$
            assertNull(StringUtil.sanitizeXmlChars(goodXml));
    
            badXml = "Text "; //$NON-NLS-1$
            // append control characters e.g. 30=>Record Separator 31=>Unit Separator
            for (int i = 1; i <= 31; i++) {
                // skip valid control characters: Horizontal Tab, Line Feed, Carriage Return
                if (i == 9 || i == 10 || i == 13) {
                    continue;
                }
                badXml += String.valueOf((char) i);
            }
            badXml += "Text2";
            result = StringUtil.sanitizeXmlChars(badXml);
            assertEquals(goodText, result);
        }
    
    }
    

    Alternative solution, using a third party library e.g. apache commons-lang:

    String cleanInput = StringEscapeUtils.escapeXml10(input)
    

    https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html#escapeXml10(java.lang.String)