Parsing online HTML pages using msxml/IXMLDOMDocument

Any good tutorial on parsing online HTML pages using msxml/IXMLDOMDocument?

I need to parse HTML pages using XPATH expressions.

Most probably some of HTML pages will not be 100% valid , so I need to configure parser to be more "friendly" or not so strict for such pages.

Any ideas?

Solution

You can tidy up invalid html using tidy or a tidy wrapper library. After doing this you can parse the html with specifying xhtml namespace using MSXML.
EfTidy is a good, up to date open source tidy wrapper project to tidying up html.
I want to show an example written in VBScript to addressing with XPath to get title of this question.

'EfTidy constants
Const XhtmlOut = 1
Const DoctypeLoose = 3 'for transitional

Dim EfTidy, sInvalidHTML, sValidHTML

With CreateObject("MSXML2.XMLHTTP.6.0")
    .open "GET", "http://stackoverflow.com/q/12027205/"
    .send
    sInvalidHTML = .responseText
End With

Set EfTidy = CreateObject("EfTidy.tidyCom")
With EfTidy.Option 'config
    .Clean = True
    .OutputType = XhtmlOut
    .DoctypeMode = DoctypeLoose
End With
sValidHTML = EfTidy.TidyMemToMem(sInvalidHTML)

With CreateObject("MSXML2.DomDocument.6.0")
    .async = False
    .validateOnParse = False
    .resolveExternals = True
    .setProperty "ProhibitDTD", False
    If .LoadXml(sValidHTML) Then
        .setProperty "SelectionLanguage", "XPath"
        .setProperty "SelectionNamespaces", "xmlns:xhtml='http://www.w3.org/1999/xhtml'"
        WScript.Echo .SelectSingleNode("//xhtml:div[@id='question-header']/xhtml:h1").Text
    End If
End With

Hope it helps.