Search code examples
excelvbaweb-scrapingxmlhttprequest

Excel VBA Web Scraping - Ignore One of Multiple Tables via XML HTTP Request


I could really use some help figuring out a piece of a web scraping code I can't seem to get to work:

  • Short version of my question: Is there a way to write into a XML HTTP Request code to ignore a table on a webpage?

Long version of my question: The page has 10 tables of football players (some with a couple rows, some with several...each "small" table represents a tier). The last table on the page - with table id = "table_10" - is a large, comprehensive table of all positions...not just quarterbacks (which the page and smaller tables are dedicated to)

With the below code, I only get "table_10" in my Excel sheet:

Option Explicit

Sub ETR_QB_Tiers_XMLHTTP()

   Dim XMLPage As New MSXML2.XMLHTTP60
   Dim HTMLDoc As New MSHTML.HTMLDocument

   XMLPage.Open "GET", "https://establishtherun.com/2020-tiers-of-evan-quarterbacks/", False
   XMLPage.send

   If XMLPage.Status <> 200 Then
      MsgBox XMLPage.Status & " - " & XMLPage.statusText
      Exit Sub
   End If

   HTMLDoc.body.innerHTML = XMLPage.responseText

   ProcessHTMLPage HTMLDoc

End Sub

Option Explicit

Sub ProcessHTMLPage(HTMLPage As MSHTML.HTMLDocument)

   Dim HTMLTable As MSHTML.IHTMLElement
   Dim HTMLTables As MSHTML.IHTMLElementCollection
   Dim HTMLRow As MSHTML.IHTMLElement
   Dim HTMLCell As MSHTML.IHTMLElement
   Dim RowNum As Long, ColNum As Integer

   Set HTMLTables = HTMLPage.getElementsByTagName("table")

   For Each HTMLTable In HTMLTables
      'Debug.Print HTMLTable.ID

      Sheets("XMLHTTP").Select

      RowNum = 1
      For Each HTMLRow In HTMLTable.getElementsByTagName("tr")
         'Debug.Print vbTab & HTMLRow.innerText

         ColNum = 1
         For Each HTMLCell In HTMLRow.Children
            'Debug.Print vbTab & HTMLCell.innerText
            Cells(RowNum, ColNum) = HTMLCell.innerText
            ColNum = ColNum + 1
         Next HTMLCell

         RowNum = RowNum + 1

      Next HTML Row

   Next HTMLTable

End Sub

When I set the code to Debug.Print HTMLTable.ID with the For Each HTMLTable In HTMLTables line of the ProcessHTMLPage sub, I am presented all 10 table IDs in the Immediate Window:

table_1
table_2
table_3
. . .
table_10

When I set the code to Debug.Print vbTab & HTMLRow.innertext with the For Each HTMLTable In HTMLTables line, I am presented results for both smaller tables (tables 1 through 9) and the large table (table 10) in the Immediate Window:

table_1
   TierOne
   Patrick Mahomes (QB1)Lamar Jackson (QB2)
table_2
   TierTwo
   Dak Prescott (QB3)Josh Allen (QB4)
   Deshaun Watson (QB5)Russell Wilson (QB6)
   Kyler Murray (QB7)
. . .
table_10
   RankWRRBTEQB
   1Michael Thomas (1)Christian McCaffrey (1)Travis Kelce (1)Patrick Mahomes (1)
   2Davante Adams (1)Ezekiel Elliott (1)George Kittle (1)Lamar Jackson (1)
   3Tyreek Hill (1)Saquon Barkley (1)Zach Ertz (1)Dak Prescott (2)
   ...

So - I know those "smaller" table are there and accessible, but the code spits out only the comprehensive "table_10" (below) while I'm really wanting the separate tables 1 through 9 - not table 10 at all:

Again...is there a way to ignore "table_10" and ensure I'm given tables 1 through 9 (instead of just "table_10")? I've tried to incorporate 'If' statements so many ways now that I've lost count.


Bonus question - the "smaller" tables are set up in a Z-pattern (i.e. - cell A1 is Player #1 > cell B1 is Player #2 > cell A2 is Player #3 > cell B2 is Player #4, etc.). Is there a way to get players in Column B to come over to Column A in their ranked order? Basically, convert two columns to one?


Solution

  • I think you are overwriting the tables so you only see the last table (being the largest it likely overwrites everything from before).

    Try moving RowNum = 1 to before the loop over tables otherwise I think you reset for each table and overwrite.

    You might also want to add an r + 1 before Next HTMLTable so as to have some whitespace between tables.

    Here is a trivial example to demonstrate:

    Your logic:

    Option Explicit
    
    Public Sub Demo_XMLHTTP()
    
       Dim XMLPage As New MSXML2.XMLHTTP60
       Dim HTMLDoc As New MSHTML.HTMLDocument
    
       XMLPage.Open "GET", "https://www.w3schools.com/html/html_tables.asp", False
       XMLPage.send
    
       If XMLPage.Status <> 200 Then
          MsgBox XMLPage.Status & " - " & XMLPage.statusText
          Exit Sub
       End If
    
       HTMLDoc.body.innerHTML = XMLPage.responseText
    
       ProcessHTMLPage HTMLDoc
    
    End Sub
    
    
    Public Sub ProcessHTMLPage(HTMLPage As MSHTML.HTMLDocument)
    
       Dim HTMLTable As MSHTML.IHTMLElement
       Dim HTMLTables()
       Dim HTMLRow As MSHTML.IHTMLElement
       Dim HTMLCell As MSHTML.IHTMLElement
       Dim RowNum As Long, ColNum As Long
       Dim hTable As Variant
    
       HTMLTables = Array(1, 2, 3)
    
       For Each hTable In HTMLTables
    
          Set HTMLTable = HTMLPage.getElementById("customers") '<== yeah same table but imagine it is different
    
          ThisWorkbook.Worksheets("XMLHTTP").Select
    
          RowNum = 1
    
          With ActiveSheet
    
          For Each HTMLRow In HTMLTable.getElementsByTagName("tr")
    
             ColNum = 1
    
             For Each HTMLCell In HTMLRow.Children
                .Cells(RowNum, ColNum) = HTMLCell.innerText
                ColNum = ColNum + 1
             Next HTMLCell
    
             RowNum = RowNum + 1
    
          Next HTMLRow
          Set HTMLTable = Nothing
          .Cells(RowNum, ColNum + 1) = hTable '< note which iteration we are viewing
          End With
    
       Next hTable
    
    End Sub
    

    Versus:

    Option Explicit
    
    Public Sub Demo_XMLHTTP()
    
       Dim XMLPage As New MSXML2.XMLHTTP60
       Dim HTMLDoc As New MSHTML.HTMLDocument
    
       XMLPage.Open "GET", "https://www.w3schools.com/html/html_tables.asp", False
       XMLPage.send
    
       If XMLPage.Status <> 200 Then
          MsgBox XMLPage.Status & " - " & XMLPage.statusText
          Exit Sub
       End If
    
       HTMLDoc.body.innerHTML = XMLPage.responseText
    
       ProcessHTMLPage HTMLDoc
    
    End Sub
    
    
    Public Sub ProcessHTMLPage(HTMLPage As MSHTML.HTMLDocument)
    
       Dim HTMLTable As MSHTML.IHTMLElement
       Dim HTMLTables()
       Dim HTMLRow As MSHTML.IHTMLElement
       Dim HTMLCell As MSHTML.IHTMLElement
       Dim RowNum As Long, ColNum As Long
       Dim hTable As Variant
    
       HTMLTables = Array(1, 2, 3)
    
       RowNum = 1
    
       For Each hTable In HTMLTables
    
          Set HTMLTable = HTMLPage.getElementById("customers") '<== yeah same table but imagine it is different
    
          ThisWorkbook.Worksheets("XMLHTTP").Select
    
          With ActiveSheet
    
          For Each HTMLRow In HTMLTable.getElementsByTagName("tr")
    
             ColNum = 1
    
             For Each HTMLCell In HTMLRow.Children
                .Cells(RowNum, ColNum) = HTMLCell.innerText
                ColNum = ColNum + 1
             Next HTMLCell
    
             RowNum = RowNum + 1
    
          Next HTMLRow
          Set HTMLTable = Nothing
          .Cells(RowNum, ColNum + 1) = hTable '< note which iteration we are viewing
          End With
    
       Next hTable
    
    End Sub
    

    Ignore table 10:

    You can use a For i = 0 To HTMLTables.Length - 2, rather than For Each, to ignore last table. Access any given table with HTMLTables.item(i). Otherwise, you could test the id and ignore based on that, or even based on index (remember to -1). I would probably use id as more reliable. Normally, you would run to .Length-1.


    Bonus:

    I can't run against your test page but if you do a For Loop, you can adjust the column to write out to, depending on whether i is odd or even (use MOD for example); Odd number MOD 2 = 1; Even MOD 2 = 0 then adjust ColNum with -1, or as appropriate.