Search code examples
vbaparsingexcelhtml-parsingweb-crawler

Parse HTML content in VBA


I have a question relating to HTML parsing. I have a website with some products and I would like to catch text within page into my current spreadsheet. This spreadsheet is quite big but contains ItemNbr in 3rd column, I expect the text in the 14th column and one row corresponds to one product (item).

My idea is to fetch the 'Material' on the webpage which is inside the Innertext after tag. The id number changes from one page to page (sometimes ).

Here is the structure of the website:

<div style="position:relative;">
    <div></div>
    <table id="list-table" width="100%" tabindex="1" cellspacing="0" cellpadding="0" border="0" role="grid" aria-multiselectable="false" aria-labelledby="gbox_list-table" class="ui-jqgrid-btable" style="width: 930px;">
        <tbody>
            <tr class="jqgfirstrow" role="row" style="height:auto">
                <td ...</td>
                <td ...</td>
            </tr>
            <tr role="row" id="1" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                <td ...</td>
                <td ...</td>
            </tr>
            <tr role="row" id="2" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                <td ...</td>
                <td ...</td>
            </tr>
            <tr role="row" id="3" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                <td ...</td>
                <td ...</td>
            </tr>
            <tr role="row" id="4" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                <td ...</td>
                <td ...</td>
            </tr>
            <tr role="row" id="5" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                <td ...</td>
                <td ...</td>
            </tr>
            <tr role="row" id="6" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                <td ...</td>
                <td ...</td>
            </tr>
            <tr role="row" id="7" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">
                <td role="gridcell" style="padding-left:10px" title="Material" aria-describedby="list-table_">Material</td>
                <td role="gridcell" style="" title="600D polyester." aria-describedby="list-table_">600D polyester.</td>
            </tr>           
            <tr ...>
            </tr>
        </tbody>
    </table> </div>

I would like to get "600D Polyester" as a result.

My (not working) code snippet is as is:

Sub ParseMaterial()

    Dim Cell As Integer
    Dim ItemNbr As String

    Dim AElement As Object
    Dim AElements As IHTMLElementCollection
Dim IE As MSXML2.XMLHTTP60
Set IE = New MSXML2.XMLHTTP60

Dim HTMLDoc As MSHTML.HTMLDocument
Dim HTMLBody As MSHTML.HTMLBody

Set HTMLDoc = New MSHTML.HTMLDocument
Set HTMLBody = HTMLDoc.body

For Cell = 1 To 5                            'I iterate through the file row by row

    ItemNbr = Cells(Cell, 3).Value           'ItemNbr isin the 3rd Column of my spreadsheet

    IE.Open "GET", "http://www.example.com/?item=" & ItemNbr, False
    IE.send

    While IE.ReadyState <> 4
        DoEvents
    Wend

    HTMLBody.innerHTML = IE.responseText

    Set AElements = HTMLDoc.getElementById("list-table").getElementsByTagName("tr")
    For Each AElement In AElements
        If AElement.Title = "Material" Then
            Cells(Cell, 14) = AElement.nextNode.value     'I write the material in the 14th column
        End If
    Next AElement

        Application.Wait (Now + TimeValue("0:00:2"))

Next Cell

Thanks for your help !


Solution

  • Just a couple things that hopefully will get you in the right direction:

    • clean up a bit: remove the readystate property testing loop. The value returned by the readystate property will never change in this context - code will pause after the send instruction, to resume only once the server response is received, or has failed to do so. The readystate property will be set accordingly, and the code will resume execution. You should still test for the ready state, but the loop is just unnecessary

    • target the right HTML elements: you are searching through the tr elements - while the logic of how you use these elements in your code actually looks to point to td elements

    • make sure the properties are actually available for the objects you are using them on: to help you with this, try and declare all your variable as specific objects instead of the generic Object. This will activate intellisense. If you have a difficult time finding the actual name of your object as defined in the relevant library in a first place, declare it as the generic Object, run your code, and then inspect the type of the object - by printing typename(your_object) to the debug window for instance. This should put you on your way

    I have also included some code below that may help. If you still can't get this to work and you can share your urls - plz do that.

    Sub getInfoWeb()
    
        Dim cell As Integer
        Dim xhr As MSXML2.XMLHTTP60
        Dim doc As MSHTML.HTMLDocument
        Dim table As MSHTML.HTMLTable
        Dim tableCells As MSHTML.IHTMLElementCollection
        
        Set xhr = New MSXML2.XMLHTTP60
       
        For cell = 1 To 5
            
            ItemNbr = Cells(cell, 3).Value
            
            With xhr
            
                .Open "GET", "http://www.example.com/?item=" & ItemNbr, False
                .send
                
                If .readyState = 4 And .Status = 200 Then
                    Set doc = New MSHTML.HTMLDocument
                    doc.body.innerHTML = .responseText
                Else
                    MsgBox "Error" & vbNewLine & "Ready state: " & .readyState & _
                    vbNewLine & "HTTP request status: " & .Status
                End If
                
            End With
            
            Set table = doc.getElementById("list-table")
            Set tableCells = table.getElementsByTagName("td")
            
            For Each tableCell In tableCells
                If tableCell.getAttribute("title") = "Material" Then
                    Cells(cell, 14).Value = tableCell.NextSibling.innerHTML
                End If
            Next tableCell
        
        Next cell
        
    End Sub
    

    EDIT: as a follow-up to the further information you provided in the comment below - and the additional comments I have added

    'Determine your product number
        'Open an xhr for your source url, and retrieve the product number from there - search for the tag which
        'text include the "productnummer:" substring, and extract the product number from the outerstring
        'OR
        'if the product number consistently consists of the fctkeywords you are entering in your source url
        'with two "0" appended - just build the product number like that
    'Open an new xhr for this url "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2productspec-ajax.p?itemc=" & product_number & "&_search=false&rows=-1&page=1&sidx=&sord=asc"
    'Load the response in an XML document, and retrieve the material information
    
    Sub getInfoWeb()
    
        Dim xhr As MSXML2.XMLHTTP60
        Dim doc As MSXML2.DOMDocument60
        Dim xmlCell As MSXML2.IXMLDOMElement
        Dim xmlCells As MSXML2.IXMLDOMNodeList
        Dim materialValueElement As MSXML2.IXMLDOMElement
        
        Set xhr = New MSXML2.XMLHTTP60
            
            With xhr
                
                .Open "GET", "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2productspec-ajax.p?itemc=10031700&_search=false&rows=-1&page=1&sidx=&sord=asc", False
                .send
                
                If .readyState = 4 And .Status = 200 Then
                    Set doc = New MSXML2.DOMDocument60
                    doc.LoadXML .responseText
                Else
                    MsgBox "Error" & vbNewLine & "Ready state: " & .readyState & _
                    vbNewLine & "HTTP request status: " & .Status
                End If
                
            End With
            
            Set xmlCells = doc.getElementsByTagName("cell")
    
            For Each xmlCell In xmlCells
                If xmlCell.Text = "Materiaal" Then
                    Set materialValueElement = xmlCell.NextSibling
                End If
            Next
            
            MsgBox materialValueElement.Text
        
    End Sub
    

    EDIT2: an alternative automating IE

    Sub searchWebViaIE()
        Dim ie As SHDocVw.InternetExplorer
        Dim doc As MSHTML.HTMLDocument
        Dim anchors As MSHTML.IHTMLElementCollection
        Dim anchor As MSHTML.HTMLAnchorElement
        Dim prodSpec As MSHTML.HTMLAnchorElement
        Dim tableCells As MSHTML.IHTMLElementCollection
        Dim materialValueElement As MSHTML.HTMLTableCell
        Dim tableCell As MSHTML.HTMLTableCell
        
        Set ie = New SHDocVw.InternetExplorer
        
        With ie
            .navigate "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2facetmain.p?fctkeywords=100317&world=general#tabs-4"
            .Visible = True
            
            Do While .readyState <> READYSTATE_COMPLETE Or .Busy = True
                DoEvents
            Loop
            
            Set doc = .document
            
            Set anchors = doc.getElementsByTagName("a")
            
            For Each anchor In anchors
                If InStr(anchor.innerHTML, "Product Specificatie") <> 0 Then
                    anchor.Click
                    Exit For
                End If
            Next anchor
            
            Do While .readyState <> READYSTATE_COMPLETE Or .Busy = True
                DoEvents
            Loop
        
        End With
            
        For Each anchor In anchors
            If InStr(anchor.innerHTML, "Product Specificatie") <> 0 Then
                Set prodSpec = anchor
            End If
        Next anchor
        
        Set tableCells = doc.getElementById("list-table").getElementsByTagName("td")
        
        If Not tableCells Is Nothing Then
            For Each tableCell In tableCells
                If tableCell.innerHTML = "Materiaal" Then
                    Set materialValueElement = tableCell.NextSibling
                End If
            Next tableCell
        End If
        
        MsgBox materialValueElement.innerHTML
        
    End Sub