Search code examples
excelvbaweb-scrapinggetelementsbytagname

Web Scraping - Problems with tags


I am new scraping web data and also using For...Next. I am trying to get data (all pages) from a website but it seems the code is wrong, since I get error 91. This is the code:

Dim ie As Object

Sub connect()
Set ie = CreateObject("INTERNETEXPLORER.APPLICATION")
ie.NAVIGATE "https://www.worldathletics.org/world-rankings/100m/men"
ie.Visible = True
End Sub

Sub id_tr_td_for()

Range("a1:z10000").ClearContents

For i = 0 To 10
For j = 0 To 5
Cells(i + 1, j + 1) = ie.document.getElementById("toplists").getElementsByTagName("tr")(i).getElementsByTagName("td")(j).innerText
Next
Next
End Sub

Can somebody help me with it and also to let me know who can I list all pages?

Thank you.


Solution

  • I'm not sure where the error comes from, I got it too.

    The following code should be helpful, it will print the contents of the table for the specified page(s) to the debug window.

    The following code should copy all the data for selected pages to sheet1

    You will need to Add a couple of references in the VBA Editor to be able to use it. (Tools Menu, References and then find and select them) Microsoft HTML Object Library and Microsoft Internet Controls

    Const MaxPage = 2 ' set to 26 (or however many there are) - at 2 for testing purposes
    Dim Browser As InternetExplorer
    
    Sub Start()
    Dim Page As Integer: Page = 1 ' start at page 1
    Dim PageDocument As IHTMLDocument
    Dim RecordRow As IHTMLElementCollection
    Dim RecordItem As IHTMLElement
    
    Dim Sheet As Worksheet: Set Sheet = ThisWorkbook.Worksheets("Sheet1") ' output sheet
    If Browser Is Nothing Then
        Set Browser = New InternetExplorer
    End If
    Dim oRow As Integer: oRow = 2 ' begin output at row 2 (account for header)
    Dim Record As Integer
    For Page = 1 To MaxPage
        LoadPage Page
            For Record = 0 To 99 ' zero index, 100 items (1-99)
            Set PageDocument = Browser.Document
            Set RecordRow = PageDocument.getElementById("toplists").getElementsByTagName("table")(0).getElementsByTagName("tbody")(0).getElementsByTagName("tr")(Record).getElementsByTagName("td")
            Sheet.Cells(oRow, 1).Value = Trim(RecordRow(0).innerText)
            Sheet.Cells(oRow, 2).Value = Trim(RecordRow(1).innerText)
            Sheet.Cells(oRow, 3).Value = Trim(RecordRow(2).innerText)
            Sheet.Cells(oRow, 4).Value = Trim(RecordRow(3).innerText)
            Sheet.Cells(oRow, 5).Value = Trim(RecordRow(4).innerText)
            Sheet.Cells(oRow, 6).Value = Trim(RecordRow(5).innerText)
            oRow = oRow + 1
        Next Record
    Next Page
    Browser.Quit
    End Sub
    
    Sub LoadPage(ByVal PageNumber As Integer)
    Debug.Print "Navigating to Page #" & CStr(PageNumber)
    Browser.navigate "https://www.worldathletics.org/world-rankings/100m/men?page=" & CStr(PageNumber)
    While Browser.readyState <> 4 Or Browser.Busy: DoEvents: Wend
    Debug.Print "Navigation Complete"
    End Sub
    

    Updated Code

    The Index Out-of-Bound error likely occurred due to the hard-coded indexes, if a page does not have 99 records it will fail, if a record doesn't have 5 fields, it will fail. The following code does away with indexes and just scrapes every row and cell it finds. You shouldn't get index errors but the output could be jagged.

    Further Update

    The 462 error was caused by the Browser.Quit. This closes the browser but does not set the reference to Nothing so when you run the code again it is trying to use a non-existent browser. Explicitly setting it to nothing at the end fixes this.

    There is no link in the competitor column, the whole row has a data-url which is handled by something else. That URL can easily be accessed though.

    Sub NewStart()
    Dim PageDocument As IHTMLDocument
    
    Dim Records As IHTMLElementCollection
    Dim Record As IHTMLElement
    Dim RecordItems As IHTMLElementCollection
    Dim RecordItem As IHTMLElement
    
    Dim OutputRow As Integer: OutputRow = 2
    Dim OutputColumn As Integer
    
    Dim Page As Integer
    
    Dim Sheet As Worksheet: Set Sheet = ThisWorkbook.Worksheets("Sheet1")
    If Browser Is Nothing Then
        Set Browser = New InternetExplorer
        Browser.Visible = True
    End If
    For Page = 1 To MaxPage
        LoadPage Page
        Set PageDocument = Browser.Document
        Set Records = PageDocument.getElementById("toplists").getElementsByTagName("table")(0).getElementsByTagName("tbody")(0).getElementsByTagName("tr")
        For Each Record In Records
            Set RecordItems = Record.getElementsByTagName("td")
            OutputColumn = 1
            For Each RecordItem In RecordItems
                Sheet.Cells(OutputRow, OutputColumn).Value = Trim(RecordItem.innerText)
                OutputColumn = OutputColumn + 1
            Next RecordItem
            Sheet.Cells(OutputRow, OutputColumn).Value = "http://worldathletics.org/" & Record.getAttribute("data-athlete-url") ' This will add the link after the last column
            OutputRow = OutputRow + 1
        Next Record
    Next Page
    Browser.Quit
    Set Browser = Nothing ' This will fix the 462 error
    End Sub