I could really use some help figuring out a piece of a web scraping code I can't seem to get to work:
Long version of my question: The page has 10 tables of football players (some with a couple rows, some with several...each "small" table represents a tier). The last table on the page - with table id = "table_10" - is a large, comprehensive table of all positions...not just quarterbacks (which the page and smaller tables are dedicated to)
With the below code, I only get "table_10" in my Excel sheet:
Option Explicit
Sub ETR_QB_Tiers_XMLHTTP()
Dim XMLPage As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
XMLPage.Open "GET", "https://establishtherun.com/2020-tiers-of-evan-quarterbacks/", False
XMLPage.send
If XMLPage.Status <> 200 Then
MsgBox XMLPage.Status & " - " & XMLPage.statusText
Exit Sub
End If
HTMLDoc.body.innerHTML = XMLPage.responseText
ProcessHTMLPage HTMLDoc
End Sub
Option Explicit
Sub ProcessHTMLPage(HTMLPage As MSHTML.HTMLDocument)
Dim HTMLTable As MSHTML.IHTMLElement
Dim HTMLTables As MSHTML.IHTMLElementCollection
Dim HTMLRow As MSHTML.IHTMLElement
Dim HTMLCell As MSHTML.IHTMLElement
Dim RowNum As Long, ColNum As Integer
Set HTMLTables = HTMLPage.getElementsByTagName("table")
For Each HTMLTable In HTMLTables
'Debug.Print HTMLTable.ID
Sheets("XMLHTTP").Select
RowNum = 1
For Each HTMLRow In HTMLTable.getElementsByTagName("tr")
'Debug.Print vbTab & HTMLRow.innerText
ColNum = 1
For Each HTMLCell In HTMLRow.Children
'Debug.Print vbTab & HTMLCell.innerText
Cells(RowNum, ColNum) = HTMLCell.innerText
ColNum = ColNum + 1
Next HTMLCell
RowNum = RowNum + 1
Next HTML Row
Next HTMLTable
End Sub
When I set the code to Debug.Print HTMLTable.ID
with the For Each HTMLTable In HTMLTables
line of the ProcessHTMLPage sub, I am presented all 10 table IDs in the Immediate Window:
table_1
table_2
table_3
. . .
table_10
When I set the code to Debug.Print vbTab & HTMLRow.innertext
with the For Each HTMLTable In HTMLTables
line, I am presented results for both smaller tables (tables 1 through 9) and the large table (table 10) in the Immediate Window:
table_1
TierOne
Patrick Mahomes (QB1)Lamar Jackson (QB2)
table_2
TierTwo
Dak Prescott (QB3)Josh Allen (QB4)
Deshaun Watson (QB5)Russell Wilson (QB6)
Kyler Murray (QB7)
. . .
table_10
RankWRRBTEQB
1Michael Thomas (1)Christian McCaffrey (1)Travis Kelce (1)Patrick Mahomes (1)
2Davante Adams (1)Ezekiel Elliott (1)George Kittle (1)Lamar Jackson (1)
3Tyreek Hill (1)Saquon Barkley (1)Zach Ertz (1)Dak Prescott (2)
...
So - I know those "smaller" table are there and accessible, but the code spits out only the comprehensive "table_10" (below) while I'm really wanting the separate tables 1 through 9 - not table 10 at all:
Again...is there a way to ignore "table_10" and ensure I'm given tables 1 through 9 (instead of just "table_10")? I've tried to incorporate 'If' statements so many ways now that I've lost count.
Bonus question - the "smaller" tables are set up in a Z-pattern (i.e. - cell A1 is Player #1 > cell B1 is Player #2 > cell A2 is Player #3 > cell B2 is Player #4, etc.). Is there a way to get players in Column B to come over to Column A in their ranked order? Basically, convert two columns to one?
I think you are overwriting the tables so you only see the last table (being the largest it likely overwrites everything from before).
Try moving RowNum = 1
to before the loop over tables otherwise I think you reset for each table and overwrite.
You might also want to add an r + 1 before Next HTMLTable
so as to have some whitespace between tables.
Here is a trivial example to demonstrate:
Your logic:
Option Explicit
Public Sub Demo_XMLHTTP()
Dim XMLPage As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
XMLPage.Open "GET", "https://www.w3schools.com/html/html_tables.asp", False
XMLPage.send
If XMLPage.Status <> 200 Then
MsgBox XMLPage.Status & " - " & XMLPage.statusText
Exit Sub
End If
HTMLDoc.body.innerHTML = XMLPage.responseText
ProcessHTMLPage HTMLDoc
End Sub
Public Sub ProcessHTMLPage(HTMLPage As MSHTML.HTMLDocument)
Dim HTMLTable As MSHTML.IHTMLElement
Dim HTMLTables()
Dim HTMLRow As MSHTML.IHTMLElement
Dim HTMLCell As MSHTML.IHTMLElement
Dim RowNum As Long, ColNum As Long
Dim hTable As Variant
HTMLTables = Array(1, 2, 3)
For Each hTable In HTMLTables
Set HTMLTable = HTMLPage.getElementById("customers") '<== yeah same table but imagine it is different
ThisWorkbook.Worksheets("XMLHTTP").Select
RowNum = 1
With ActiveSheet
For Each HTMLRow In HTMLTable.getElementsByTagName("tr")
ColNum = 1
For Each HTMLCell In HTMLRow.Children
.Cells(RowNum, ColNum) = HTMLCell.innerText
ColNum = ColNum + 1
Next HTMLCell
RowNum = RowNum + 1
Next HTMLRow
Set HTMLTable = Nothing
.Cells(RowNum, ColNum + 1) = hTable '< note which iteration we are viewing
End With
Next hTable
End Sub
Versus:
Option Explicit
Public Sub Demo_XMLHTTP()
Dim XMLPage As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
XMLPage.Open "GET", "https://www.w3schools.com/html/html_tables.asp", False
XMLPage.send
If XMLPage.Status <> 200 Then
MsgBox XMLPage.Status & " - " & XMLPage.statusText
Exit Sub
End If
HTMLDoc.body.innerHTML = XMLPage.responseText
ProcessHTMLPage HTMLDoc
End Sub
Public Sub ProcessHTMLPage(HTMLPage As MSHTML.HTMLDocument)
Dim HTMLTable As MSHTML.IHTMLElement
Dim HTMLTables()
Dim HTMLRow As MSHTML.IHTMLElement
Dim HTMLCell As MSHTML.IHTMLElement
Dim RowNum As Long, ColNum As Long
Dim hTable As Variant
HTMLTables = Array(1, 2, 3)
RowNum = 1
For Each hTable In HTMLTables
Set HTMLTable = HTMLPage.getElementById("customers") '<== yeah same table but imagine it is different
ThisWorkbook.Worksheets("XMLHTTP").Select
With ActiveSheet
For Each HTMLRow In HTMLTable.getElementsByTagName("tr")
ColNum = 1
For Each HTMLCell In HTMLRow.Children
.Cells(RowNum, ColNum) = HTMLCell.innerText
ColNum = ColNum + 1
Next HTMLCell
RowNum = RowNum + 1
Next HTMLRow
Set HTMLTable = Nothing
.Cells(RowNum, ColNum + 1) = hTable '< note which iteration we are viewing
End With
Next hTable
End Sub
Ignore table 10:
You can use a For i = 0 To HTMLTables.Length - 2
, rather than For Each
, to ignore last table. Access any given table with HTMLTables.item(i)
. Otherwise, you could test the id and ignore based on that, or even based on index (remember to -1). I would probably use id as more reliable. Normally, you would run to .Length-1
.
Bonus:
I can't run against your test page but if you do a For Loop
, you can adjust the column to write out to, depending on whether i
is odd or even (use MOD
for example); Odd number MOD 2 = 1
; Even MOD 2 = 0
then adjust ColNum
with -1, or as appropriate.