Search code examples
javaxmlmalformed

How to detect "Invalid character found in text content"


I'm doing an XML validation in Java, using SAX, and i'd like to recognize the following kind of error : "An invalid character was found in text content".

At the moment, i have a validation with SAX, and for some documents i have corrupted characters not detected as errors. When i try to open the result XML file with IE Browser for example, i get an error message "an invalid character was found in text content".

This is an example of XML data:

<?xml version='1.0' encoding='UTF-8' standalone='yes'>
<!DOCTYPE blabla SYSTEM 'blabla.dtd'>
<blabla type='type' num='num'>
<...>... corrupted character </...>
</blabla>

And this is an example of the instanciation of the parser:

SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setValidating(true);
factory.setNamespaceAware(true);

parser = factory.newSAXParser();
parser.setProperty(JAXP_SCHEMA_LANGUAGE, W3C_XML_SCHEMA);
parser.setProperty(JAXP_SCHEMA_SOURCE, new File(theConfig.getRoot()
        .concat(File.separator).concat(theConfig.getXsdFileName())
        .concat("-v").concat(theConfig.getXsdFileVersion()).concat(
                        XSD_EXTENSION)));
reader = parser.getXMLReader();
reader.setErrorHandler(getHandler());
reader.setEntityResolver(new MyEntityResolver(theConfig.getRoot(),
                theConfig));
InputSource is = new InputSource();
is.setCharacterStream(new StringReader(theDataToParse));
reader.parse(is);

The error handler implements methods 'warning', 'error' and 'fatalError', but nothing is detected. The entity resolver enable to lead a custome entity file, stored in a configuration directory.

Does someone have an idea why such malformed character error is not detected ? Is it because my stream comes from a String and not a file ?

Thanks in advance for your help.

Regards.


Solution

  • yes, apparently you have already done the byte to character conversion since you are holding the string already. if you want to detect the invalid character, you need to parse the bytes. in general, it's not good to hold xml data as string data as you risk corrupting it through incorrect character encoding. the best way to treat xml is as binary data.