I've encountered a problem using the xerces-dom library:
When you're adding a comments to the xml-tree like:
DOMDocument* doc = impl->createDocument(0, L"root", 0);
DOMElement* root = doc->getDocumentElement();
DOMComment* com1 = doc->createComment(L"SetA -- DataA");
DOMComment* com2 = doc->createComment(L"SetB -- DataB");
doc->insertBefore(com1, root);
doc->insertBefore(com2, root);
That will create the following xml-tree:
<?xml version="1.0" encoding="UTF-8" standalone="false"?>
<!--SetA -- DataA-->
<!--SetB -- DataB-->
<root/>
which is indeed invalid xml.
The same can be done with processing instructions by using ?>
as data:
DOMProcessingInstruction procInstr = doc->createProcessingInstruction(L"target", L"?>");
My question:
Is there a way i can configure xerces to not create these kind of comments or do i have to check for these things myself?
And my other question: Why isn't it possible to just always escape characters like <>&'"
, even in comments and processing instructions, in order to avoid these kind of problems?
A DOMDocument is not an XML document. It is supposed to represent one, but it is conceivable that a valid DOM may not be serializable into a valid XML document (the converse should be less likely). Indeed this appears to be the case here:
Neither the Level 1 or Level2 two specs say anything about this, but the Level 3 DOM specification added this sentence about the DOMComment interface:
No lexical check is done on the content of a comment and it is therefore possible to have the character sequence "--" (double-hyphen) in the content, which is illegal in a comment per section 2.5 of [XML 1.0]. The presence of this character sequence must generate a fatal error during serialization.
So Xerces is operating within the DOM Level 3 specification even if it accepts a comment with '--' in it, as long as it bombs if you go to serialize it.
Not a great situation, but it makes sense because DOM was originally intended to represent XML Documents that have been read in, not to create new ones. So it is liberal in what it can represent. Fine for reading - a DOMComment can represent anything (and more) the XML document can, but a bit annoying that it doesn't catch the invalid string when you createComment()
.
Checking DOMDocumentImpl.cpp we see:
DOMComment *DOMDocumentImpl::createComment(const XMLCh *data)
{
return new (this, DOMMemoryManager::COMMENT_OBJECT) DOMCommentImpl(this, data);
}
And in DOMCommentImpl.cpp we have just:
DOMCommentImpl::DOMCommentImpl(DOMDocument *ownerDoc, const XMLCh *dat)
: fNode(ownerDoc), fCharacterData(ownerDoc, dat)
{
fNode.setIsLeafNode(true);
}
Finally we see in DOMCharacterDataImpl.cpp that there is no chance of validation up front - it just saves the user provided string without checking it.
DOMCharacterDataImpl::DOMCharacterDataImpl(DOMDocument *doc, const XMLCh *dat)
{
fDoc = (DOMDocumentImpl*)doc;
XMLSize_t len=XMLString::stringLen(dat);
fDataBuf = fDoc->popBuffer(len+1);
if (!fDataBuf)
fDataBuf = new (fDoc) DOMBuffer(fDoc, len+15);
fDataBuf->set(dat, len);
}
Sadly, no Xerces does not have an option or even a nice hook to check this for you. And because the Level 3 spec seems to demand that "No lexical check is done", it probably isn't even legal to add one.
The answer to your second question is simpler to answer: Because that's the way they wanted it defined it. See the XML 1.1 spec for example:
Comments
[15] Comment ::= '<!--' ((Char - '-') | ('-' (Char - '-')))* '-->'
It is similar for PIs.
The grammar simply does not allow for escapes. Seems about right: baroque and broke.
Maybe there is a way to catch the error on serialization or normalization, but I wasn't able to confirm whether Xerces 3.1 can. To be safe I think the best way is to wrap createComment()
and check for it before creating the node, or walk the tree and check it yourself.