Search code examples
visual-c++xhtmlhtml-parsingmsxml

Parsing online HTML pages using msxml/IXMLDOMDocument


Any good tutorial on parsing online HTML pages using msxml/IXMLDOMDocument?

I need to parse HTML pages using XPATH expressions.

Most probably some of HTML pages will not be 100% valid , so I need to configure parser to be more "friendly" or not so strict for such pages.

Any ideas?


Solution

  • You can tidy up invalid html using tidy or a tidy wrapper library. After doing this you can parse the html with specifying xhtml namespace using MSXML.
    EfTidy is a good, up to date open source tidy wrapper project to tidying up html.
    I want to show an example written in VBScript to addressing with XPath to get title of this question.

    'EfTidy constants
    Const XhtmlOut = 1
    Const DoctypeLoose = 3 'for transitional
    
    Dim EfTidy, sInvalidHTML, sValidHTML
    
    With CreateObject("MSXML2.XMLHTTP.6.0")
        .open "GET", "http://stackoverflow.com/q/12027205/"
        .send
        sInvalidHTML = .responseText
    End With
    
    Set EfTidy = CreateObject("EfTidy.tidyCom")
    With EfTidy.Option 'config
        .Clean = True
        .OutputType = XhtmlOut
        .DoctypeMode = DoctypeLoose
    End With
    sValidHTML = EfTidy.TidyMemToMem(sInvalidHTML)
    
    With CreateObject("MSXML2.DomDocument.6.0")
        .async = False
        .validateOnParse = False
        .resolveExternals = True
        .setProperty "ProhibitDTD", False
        If .LoadXml(sValidHTML) Then
            .setProperty "SelectionLanguage", "XPath"
            .setProperty "SelectionNamespaces", "xmlns:xhtml='http://www.w3.org/1999/xhtml'"
            WScript.Echo .SelectSingleNode("//xhtml:div[@id='question-header']/xhtml:h1").Text
        End If
    End With
    

    Hope it helps.