I am parsing and validating (xsd) long XML (always well-formed) file, reporting all validation problems.
My parser reports and continues on errors like it should, with one strange exception: when a node (parent) that consist of several nodes (children) fails validation on any of the child nodes, parsing properly continues for all children, but validation stops until next parent node starts.
Considering simple XSD:
<?xml version="1.0" encoding="UTF-8" ?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="customerDataFile">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="customerList"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element name="customerList">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="customerData" minOccurs="1" maxOccurs="unbounded"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element name="customerData">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="NameField1"/>
<xsd:element ref="NameField2"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element type="name_field" name="NameField1"/>
<xsd:element type="name_field" name="NameField2"/>
<xsd:simpleType name="name_field">
<xsd:restriction base="xsd:string">
<xsd:maxLength value="45"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:schema>
and these 5 examples:
<?xml version="1.0" encoding="UTF-8"?>
<customerDataFile xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="customerDataFile.xsd">
<customerList>
<customerData>
<NameField1>Somecompany</NameField1>
<NameField2>Somefirstname</NameField2>
</customerData>
<customerData>
<NameField1>Somecompany</NameField1>
<NameField2>Somefirstname</NameField2>
</customerData>
</customerList>
</customerDataFile>
<?xml version="1.0" encoding="UTF-8"?>
<customerDataFile xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="customerDataFile.xsd">
<customerList>
<customerData>
<Unknown1>Somecompany</Unknown1>
<NameField2>Somefirstname</NameField2>
</customerData>
<customerData>
<Unknown1>Somecompany</Unknown1>
<NameField2>Somefirstname</NameField2>
</customerData>
</customerList>
</customerDataFile>
<?xml version="1.0" encoding="UTF-8"?>
<customerDataFile xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="customerDataFile.xsd">
<customerList>
<customerData>
<NameField1>Somecompany</NameField1>
<Unknown2>Somefirstname</Unknown2>
</customerData>
<customerData>
<NameField1>Somecompany</NameField1>
<Unknown2>Somefirstname</Unknown2>
</customerData>
</customerList>
</customerDataFile>
<?xml version="1.0" encoding="UTF-8"?>
<customerDataFile xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="customerDataFile.xsd">
<customerList>
<customerData>
<Unknown1>Somecompany</Unknown1>
<Unknown2>Somefirstname</Unknown2>
</customerData>
<customerData>
<Unknown1>Somecompany</Unknown1>
<Unknown2>Somefirstname</Unknown2>
</customerData>
</customerList>
</customerDataFile>
<?xml version="1.0" encoding="UTF-8"?>
<customerDataFile xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="customerDataFile.xsd">
<customerList>
<customerData>
<Unknown2>Somefirstname</Unknown2>
</customerData>
<customerData>
<Unknown1>Somecompany</Unknown1>
</customerData>
</customerList>
</customerDataFile>
That output as follows:
This is ridiculous; I could not find any reference for anything similar (and it does look like a major issue).
The relevant code is:
public void process(String schemaLocation, String xmlLocation) {
Source source = new StreamSource(new File(schemaLocation));
SchemaFactory schemaFactory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
Schema schema = schemaFactory.newSchema(source);
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setSchema(schema);
spf.setNamespaceAware(true);
SAXParser saxParser = spf.newSAXParser();
CustomerHandler handler = new CustomerHandler();
CustomerErrorHandler errorHandler = new CustomerErrorHandler();
InputStream inputStream = new FileInputStream(new File(xmlLocation));
Reader reader = new InputStreamReader(inputStream, "UTF-8");
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
saxParser.setContentHandler(handler);
saxParser.setErrorHandler(errorHandler);
saxParser.parse(is); }
where CustomerErrorHandler is simple
public class CustomerErrorHandler implements ErrorHandler {
@Override
public void error(SAXParseException arg0) throws SAXException {
System.out.println(arg0.getMessage());
}
@Override
public void fatalError(SAXParseException arg0) throws SAXException {
System.out.println(arg0.getMessage());
}
@Override
public void warning(SAXParseException arg0) throws SAXException {
System.out.println(arg0.getMessage());
}
}
Does anyone have any pointers on why does this happen and what I am doing wrong, and, most importantly, how does one properly do full validation on an XML document if this approach does not work?
This is not really an answer, this is more of a long comment:
The continue on error feature is an extended feature and is not really standard. The exact implementation is surely available in the Xerces code base but may not be easy to figure out. Minimally, from your tests above what can be gathered is that encountering a validation error on an element, Xerces ignores validation error (though I am sure it will detect well-formedness error, you could try) till the end of the element (as there is no point in validating this element any more, it is invalid w.r.t. teh schema), in effect skipping the entire element and going to the next element and starting validation. This would be a probably behaviour as continue on error is not a standard, I guess the implementation has been done on a 'best case effort' basis, if something cannot be validated ignore it and try to validate the next element.