Search code examples
excelvbaweb-scrapingdata-extraction

How can I extract data from a website and fill an excel sheet using VBA?


I would like to extract data from betexplorer.com. I want to extract two different pieces of data from the following URL:

https://www.betexplorer.com/soccer/s...eague-1/stats/

I would like to extract Matches Played and Matches Remaining I would like to extract Home Goals and Away Goals (per match)

I have the the code to do that and it is as follows:

Option Explicit

Sub GetSoccerStats()


'Set a reference (VBE > Tools > References) to the following libraries:
'   1) Microsoft XML, v6.0
'   2) Microsoft HTML Object Library

Dim xmlReq As New MSXML2.XMLHTTP60
Dim objDoc As New MSHTML.HTMLDocument
Dim objTable As MSHTML.htmlTable
Dim objTableRow As MSHTML.htmlTableRow
Dim strURL As String
Dim strResp As String
Dim strText As String
Dim rw As Long

strURL = "https://www.betexplorer.com/soccer/south-korea/k-league-1/stats/"

With xmlReq
    .Open "GET", strURL, False
    .send
    If .Status <> 200 Then
        MsgBox "Error " & .Status & ":  " & .statusText
        Exit Sub
    End If
    strResp = .responseText
End With

Worksheets.Add

objDoc.body.innerHTML = strResp

Set objTable = objDoc.getElementsByClassName("table-main leaguestats")(0)

If Not objTable Is Nothing Then
    rw = 1
    For Each objTableRow In objTable.Rows
        strText = objTableRow.Cells(0).innerText
        Select Case strText
            Case "Matches played", "Matches remaining", "Home goals", "Away goals"
                Cells(rw, "a").Value = objTableRow.Cells(0).innerText
                Cells(rw, "b").Value = objTableRow.Cells(1).innerText
                Cells(rw, "c").Value = objTableRow.Cells(2).innerText
                rw = rw + 1
        End Select
    Next objTableRow
    Columns("a").AutoFit
End If

Set xmlReq = Nothing
Set objDoc = Nothing
Set objTable = Nothing
Set objTableRow = Nothing


End Sub

This code works however i want to take it a step further.

I actually want to run this macro for many different URL's on the same site. I have a worksheet already created that has a list of Football Leagues (in the rows), the columns hold the data.

You can find the file here : https://www.dropbox.com/s/77sol24sty75w5z/Avg%20Goals.xlsm?dl=0

This is a file where i will add leagues to the rows as i go. Is it possible to adapt the code that extracts the data so that it can populate the columns in my sheet? I do not need to pull in the names of the data (matches remaining, home goals, away goals etc) as this code does, i only need the figures. The extracted figures would have to populate the columns as per the sheet (so each row contains the data for each league. As you can see there are a few leagues so it would need to loop through each row and then use the corresponding URL for that row.

You will notice that there is a column that contains the word CURRENT. This is to indicate that it should use the URL in the Current URL column. If I change the value to LAST i would like it to use the URL in the Last URL column.

For each league it will be different if I use CURRENT or LAST.

Here is a picture of expected output:

expectedoutput

Any help is greatly appreciated.


Solution

  • Keeping in line with your code this will output the data for those items in columns M:T. I have a helper function, GetLinks, which generates an array of final urls to used based on the value in column K:

    inputArray = GetLinks(inputArray)
    

    This array is looped and xhr requests are issued for the information. All the results information is stored in an array, results, which is written out in one go to the sheet at the end.

    I work with array throughout as you don't want to keep reading from the sheet; that is an expensive operation which slows your code. For the same reason, if <> 200 occurs, I print to the immediate window the message and the url so as to not slow the code. You effectively have a log then you can review at the end.

    The retrieved results are written out from column M, but as the data is in array, you can easily write out to where ever you want; simply change the start cell for pasting from M4 to which ever top leftmost cell you want. Your existing columns do not have percentages in, so I felt safe to assume you expected the written out data to be in new columns (possibly even in a different sheet).

    Option Explicit   
    Public Sub GetSoccerStats()
        Dim xmlReq As New MSXML2.XMLHTTP60, response As String
        Dim objDoc As New MSHTML.HTMLDocument, text As String
        Dim lastRow As Long, dataSheet As Worksheet, inputArray(), i As Long
    
        Set dataSheet = ThisWorkbook.Worksheets("AVG GOAL DATA")
    
        With dataSheet
            lastRow = .Cells(.Rows.Count, "B").End(xlUp).Row
        End With
    
        inputArray = dataSheet.Range("J4:L" & lastRow).Value
        inputArray = GetLinks(inputArray)
    
        Dim results(), r As Long, c As Long
        ReDim results(1 To UBound(inputArray, 1), 1 To 8)
    
        With xmlReq
    
            For i = LBound(inputArray, 1) To UBound(inputArray, 1)
                r = r + 1
                .Open "GET", inputArray(i, 4), False
                .send
                If .Status <> 200 Then
                    Debug.Print inputArray(i, 4), vbTab, "Error " & .Status & ":  " & .statusText
                Else
                    response = .responseText
                    objDoc.body.innerHTML = response
    
                    Dim objTable As MSHTML.HTMLTable, objTableRow As MSHTML.HTMLTableRow
    
                    Set objTable = objDoc.getElementsByClassName("table-main leaguestats")(0)
    
                    If Not objTable Is Nothing Then
                        c = 1
                        For Each objTableRow In objTable.Rows
                            text = objTableRow.Cells(0).innerText
                            Select Case text
                            Case "Matches played", "Matches remaining", "Home goals", "Away goals"
                                results(r, c) = objTableRow.Cells(1).innerText
                                results(r, c + 1) = objTableRow.Cells(2).innerText
                                c = c + 2
                            End Select
                        Next objTableRow
                    End If
                End If
                Set objTable = Nothing
            Next
        End With
        dataSheet.Range("M4").Resize(UBound(results, 1), UBound(results, 2)) = results
    End Sub
    
    Public Function GetLinks(ByRef inputArray As Variant) As Variant
        Dim i As Long
        ReDim Preserve inputArray(1 To UBound(inputArray, 1), 1 To UBound(inputArray, 2) + 1)
    
        For i = LBound(inputArray, 1) To UBound(inputArray, 1)
            inputArray(i, 4) = IIf(inputArray(i, 1) = "CURRENT", inputArray(i, 2), inputArray(i, 3))
        Next
        GetLinks = inputArray
    End Function
    

    Layout of file:

    enter image description here


    Given large number of requests led to blocking here is IE version:

    'VBE > Tools > References:
    '1: Microsoft HTML Object library  2: Microsoft Internet Controls
    Public Sub GetSoccerStats()
        Dim ie As Object, t As Date
        Dim objDoc As New MSHTML.HTMLDocument, text As String
        Dim lastRow As Long, dataSheet As Worksheet, inputArray(), i As Long
    
        Const MAX_WAIT_SEC As Long = 10
    
        Set dataSheet = ThisWorkbook.Worksheets("AVG GOAL DATA")
        Set ie = CreateObject("InternetExplorer.Application")
        With dataSheet
            lastRow = .Cells(.Rows.Count, "B").End(xlUp).Row
        End With
    
        inputArray = dataSheet.Range("C4:E" & lastRow).Value
        inputArray = GetLinks(inputArray)
    
        Dim results(), r As Long, c As Long
        ReDim results(1 To UBound(inputArray, 1), 1 To 8)
    
        With ie
            .Visible = True
            For i = LBound(inputArray, 1) To UBound(inputArray, 1)
                r = r + 1
                .navigate2 inputArray(i, 4)
    
                While .Busy Or .readyState < 4: DoEvents: Wend
    
                Dim objTable As MSHTML.HTMLTable, objTableRow As MSHTML.HTMLTableRow
                t = timer
                Do
                    DoEvents
                    On Error Resume Next
                    Set objTable = .document.getElementsByClassName("table-main leaguestats")(0)
                    On Error GoTo 0
                    If Timer - t > MAX_WAIT_SEC Then Exit Do
                Loop While objTable Is Nothing
    
                If Not objTable Is Nothing Then
                    c = 1
                    For Each objTableRow In objTable.Rows
                        text = objTableRow.Cells(0).innerText
                        Select Case text
                        Case "Matches played", "Matches remaining", "Home goals", "Away goals"
                            results(r, c) = objTableRow.Cells(1).innerText
                            results(r, c + 1) = objTableRow.Cells(2).innerText
                            c = c + 2
                        End Select
                    Next objTableRow
                End If
                Set objTable = Nothing
            Next
            .Quit
        End With
        dataSheet.Range("F4").Resize(UBound(results, 1), UBound(results, 2)) = results
    End Sub