Search code examples
excelvbaweb-scrapingexport-to-excel

Excel VBA - Get all href links from a website


Example

Hi there, hoping that someone can help me. On this example link: https://www.academiadasapostas.com/stats/competition/brasil/26

I want to get all the href links which are the target of the "VS". I'm trying examples like this one:

Sub ScrapeScores()

Dim IE As New SHDocVw.InternetExplorer
Dim HTMLDoc As MSHTML.HTMLDocument
Dim HTMLTables As MSHTML.IHTMLElementCollection
Dim HTMLTable As MSHTML.IHTMLElement
Dim HTMLDiv As MSHTML.IHTMLElement
Dim TableSection As MSHTML.IHTMLElement
Dim TableRow As MSHTML.IHTMLElement
Dim TableCell As MSHTML.IHTMLElement
Dim RowText As String



IE.Visible = True
IE.navigate "https://www.academiadasapostas.com/stats/competition/brasil/26"

Do While IE.readyState <> READYSTATE_COMPLETE Or IE.Busy
Loop

Set HTMLDoc = IE.document
Set HTMLDiv = HTMLDoc.getElementById("competition-round-group-0")
Set HTMLTables = HTMLDiv.getElementsByTagName("a")

For Each HTMLTable In HTMLTables
    Debug.Print HTMLTable.ID, "&", HTMLTable.className
    
    For Each TableSection In HTMLTable.Children
        Debug.Print , TableSection.tagName
        
    Next TableSection
    
Next HTMLTable


End Sub

But with no success. I think I could use CSS with a SelectorAll, right? Since IE is going to get extinted, would be nice to use the CSS instead.

Thank you in advance for any answer.


Solution

  • You can use the following css pattern with querySelectorAll .competition-rounds td:nth-child(4) > a. Loop the returned nodeList and extract the href from each node. This selects for the 4th column within that table, then the child a tag, from which the href attribute is extracted during the loop.


    Required references:

    1. Microsoft Internet Controls
    2. Microsoft HTML Object Library

    Option Explicit
    
    Public Sub PrintLinks()
        Dim ie As SHDocVw.InternetExplorer, nodeList As MSHTML.IHTMLDOMChildrenCollection
    
        Set ie = New SHDocVw.InternetExplorer
    
        With ie
    
            .Visible = True
            .Navigate2 "https://www.academiadasapostas.com/stats/competition/brasil/26"
            
            While .Busy Or .readyState <> READYSTATE_COMPLETE: DoEvents: Wend
            
            Set nodeList = ie.Document.querySelectorAll(".competition-rounds td:nth-child(4) > a")
            
            Dim i As Long
            
            For i = 0 To nodeList.length - 1
              
                Debug.Print nodeList.Item(i).href
    
            Next
            
            Stop
    
            .Quit
        End With
    End Sub
    

    Reading:

    1. :nth-child()
    2. Child combinator