Any good tutorial on parsing online HTML pages using msxml/IXMLDOMDocument?
I need to parse HTML pages using XPATH expressions.
Most probably some of HTML pages will not be 100% valid , so I need to configure parser to be more "friendly" or not so strict for such pages.
Any ideas?
You can tidy up invalid html using tidy or a tidy wrapper library. After doing this you can parse the html with specifying xhtml namespace using MSXML.
EfTidy is a good, up to date open source tidy wrapper project to tidying up html.
I want to show an example written in VBScript to addressing with XPath to get title of this question.
'EfTidy constants
Const XhtmlOut = 1
Const DoctypeLoose = 3 'for transitional
Dim EfTidy, sInvalidHTML, sValidHTML
With CreateObject("MSXML2.XMLHTTP.6.0")
.open "GET", "http://stackoverflow.com/q/12027205/"
.send
sInvalidHTML = .responseText
End With
Set EfTidy = CreateObject("EfTidy.tidyCom")
With EfTidy.Option 'config
.Clean = True
.OutputType = XhtmlOut
.DoctypeMode = DoctypeLoose
End With
sValidHTML = EfTidy.TidyMemToMem(sInvalidHTML)
With CreateObject("MSXML2.DomDocument.6.0")
.async = False
.validateOnParse = False
.resolveExternals = True
.setProperty "ProhibitDTD", False
If .LoadXml(sValidHTML) Then
.setProperty "SelectionLanguage", "XPath"
.setProperty "SelectionNamespaces", "xmlns:xhtml='http://www.w3.org/1999/xhtml'"
WScript.Echo .SelectSingleNode("//xhtml:div[@id='question-header']/xhtml:h1").Text
End If
End With
Hope it helps.