Please consider this kind of XHTML document:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head></head>
<body>
<!--- Some comment with 3 dashes that causes parsing error --->
<!-- Rest of XHTML -->
</body>
</html>
and this partial VBScript code that I'm trying to do the parsing:
With CreateObject("MSXML2.DOMDocument.6.0")
.async = False
.setProperty "ProhibitDTD", False
.validateOnParse = False
.setProperty "SelectionLanguage", "XPath"
.setProperty "SelectionNamespaces", "xmlns:xhtml='http://www.w3.org/1999/xhtml'"
.load(url)
End With
I get error report:
Incorrect syntax was used in a comment
apparently because comment uses 3 dashes.
Does anyone know how to resolve this (without using HTTP request and correcting the XHTML source)?
As the standard clearly states:
For compatibility, the string " -- " (double-hyphen) MUST NOT occur within comments.
no decent parser should accept your 'XML' as well-formed. You may search for a faulty parser - this indicates that some version of BeautifulSoup (3.08) may accept -- in comments - but the real solution is either to clean the data before .loadXml or - better - to take a big stick to the author.