Given this example schema ("big.xsd"):
<?xml version="1.0" encoding="UTF-8" ?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="root">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="A"/>
<xsd:element name="B"/>
<xsd:element name="C1" minOccurs="0"/>
<xsd:element name="C2" minOccurs="0"/>
<xsd:element name="C3" minOccurs="0"/>
<xsd:element name="C4" minOccurs="0"/>
<xsd:element name="C5" minOccurs="0"/>
<xsd:element name="C6" minOccurs="0"/>
<xsd:element name="C7" minOccurs="0"/>
<xsd:element name="C8" minOccurs="0"/>
<xsd:element name="C9" minOccurs="0"/>
<xsd:element name="C10" minOccurs="0"/>
<xsd:element name="C11" minOccurs="0"/>
<xsd:element name="C12" minOccurs="0"/>
<xsd:element name="C13" minOccurs="0"/>
<xsd:element name="C14" minOccurs="0"/>
<xsd:element name="C15" minOccurs="0"/>
<xsd:element name="D"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:schema>
and this example document ("big.xml"):
<?xml version="1.0" ?>
<root>
<A/>
<B/>
</root>
Validating the schema with lxml reports only the first ten "missing" children (line break inserted for readability):
>>> from lxml import etree
>>> schema_doc = etree.parse('big.xsd')
>>> schema = etree.XMLSchema(schema_doc)
>>>
>>> doc = etree.parse('big.xml')
>>> schema.assertValid(doc)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "src/lxml/etree.pyx", line 3643, in lxml.etree._Validator.assertValid
lxml.etree.DocumentInvalid: Element 'root': Missing child element(s).
Expected is one of ( C1, C2, C3, C4, C5, C6, C7, C8, C9, C10 )., line 2
This is consistent with xmllint's output (I believe lxml delegates the validation to libxml2) (line break inserted for readability):
$ xmllint --noout --schema big.xsd big.xml
big.xml:2: element root: Schemas validity error : Element 'root':
Missing child element(s). Expected is one of ( C1, C2, C3, C4, C5, C6, C7, C8, C9, C10 ).
big.xml fails to validate
Is there a way to make lxml report all the missing children, in particular the D
element which is required to conform to the schema?
Notes
Is there a way to make lxml report all the missing children
I don't know, but I think it's very unlikely that a schema processor would be customisable in this way.
I thought I would try this one on Saxon. It outputs:
Validation error on line 5 column 8 of test.xml: FORG0001: In content of element
<root>
: The content is incomplete. It would be valid if followed by<Q{}D>
.
Not a perfect error message (I wonder how widely users understand the notation <Q{}D>
, for example) but it seems to capture what you are looking for.
Saxon goes to a lot of effort to analyse the situation. It gets to the end of the list of children, and finds that the state of the finite state machine is not a legitimate "final state". Rather than just reporting this blandly, it looks at all the possible transitions from this state to see if there is one that would lead to a legitimate final state, and finds that there is only one, namely a D element. On this particular occasion, that strategy works well. libxml2, by contrasts, contents itself with listing the elements that could have occurred next, and truncating that list so it doesn't get ridiculously long.
In general it's fairly easy for a validator to work out that the content is invalid, it's much harder to explain what's wrong, which essentially means finding the minimum or most likely change to the document that would turn it from an invalid document into a valid one, and no strategy is going to do that successfully all of the time.