Ran into the following exception parsing XML generated from inputs:
org.xml.sax.SAXParseException: Zeichenreferenz "&#
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
I traced the problem down to an input string containing the character 0x1f
, an invisible "UNIT SEPARATOR" character: http://www.columbia.edu/kermit/ascii.html
I had to copy the input into a text file to make it visible:
Tested the input-string in other places and also ran into problems like:
Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: XML parsing: line 1, character 149, illegal xml character
at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:262)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1632)
at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:602)
at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:524)
at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7418)
What would be the best way to strip such characters from an input string, are there other problematic characters for XML which should be removed?
This is the solution I ended up with:
/** RegEx pattern of invalid xml 1.0 characters, ref : http://www.w3.org/TR/REC-xml/#charsets */
private static final Pattern INVALID_XML_CHAR_PATTERN = Pattern
.compile("[^\\u0009\\u000A\\u000D\\u0020-\\uD7FF\\uE000-\\uFFFD\\x{10000}-\\x{10FFFF}]"); //$NON-NLS-1$
/**
* sanitize the passed value for xml 1.0
*
* @param input input value to sanitize
* @return null if input was not changed
*/
public static String sanitizeXmlChars(String input) {
if (input == null || ("".equals(input))) { //$NON-NLS-1$
return null;
}
Matcher matcher = INVALID_XML_CHAR_PATTERN.matcher(input);
if (matcher.find()) {
return matcher.replaceAll(""); //$NON-NLS-1$
}
return null;
}
Inspired by: https://www.rgagnon.com/javadetails/java-sanitize-xml-string.html
With a simple JUnit test:
public class StringUtilTest {
@Test
public void sanitizeXmlChars() {
String goodXml = "<xml>value'<sub><![CDATA[Inhaltää]]></sub></xml>"; //$NON-NLS-1$
assertNull(StringUtil.sanitizeXmlChars(goodXml));
// contains control character after <xml>
String badXml = "<xml>" + (char) 31 + "value'<sub><![CDATA[Inhaltää]]></sub></xml>"; //$NON-NLS-1$
String result = StringUtil.sanitizeXmlChars(badXml);
assertEquals(goodXml, result);
String goodText = "This is a Text.\nWith two lines."; //$NON-NLS-1$
assertNull(StringUtil.sanitizeXmlChars(goodXml));
// contains control character after two
badXml = "This is a Text.\nWith two " + (char) 31 + "lines."; //$NON-NLS-1$
result = StringUtil.sanitizeXmlChars(badXml);
assertEquals(goodText, result);
goodText = "Text Text2"; //$NON-NLS-1$
assertNull(StringUtil.sanitizeXmlChars(goodXml));
badXml = "Text "; //$NON-NLS-1$
// append control characters e.g. 30=>Record Separator 31=>Unit Separator
for (int i = 1; i <= 31; i++) {
// skip valid control characters: Horizontal Tab, Line Feed, Carriage Return
if (i == 9 || i == 10 || i == 13) {
continue;
}
badXml += String.valueOf((char) i);
}
badXml += "Text2";
result = StringUtil.sanitizeXmlChars(badXml);
assertEquals(goodText, result);
}
}
Alternative solution, using a third party library e.g. apache commons-lang:
String cleanInput = StringEscapeUtils.escapeXml10(input)