Search code examples
excelvbaweb-scrapingyahoo-finance

Web Scraping ETFs Daily Data VBA


I'm trying to web scrape some daily info of differents ETFs. I found that https://www.marketwatch.com/ have a accurate info. The most relevant info is the open Price, outstanding shares, NAV, total assets of the ETF. Here is the link for IVV US Equity: https://www.marketwatch.com/investing/fund/ivv

I have web scraped with VBA before but the HTML of the pages I had used are different, I don't know if this is because some values of the ETFs (such as Price and Taded Volume) change constantly. The idea is to create a code to extract relevant info and create a data base to analyze Macroeconomics factor using the ETFs as market indicators of flows between countries, regions, etc...

Mi first approach would be with VBA but after I get more into the data I would like to try with Python (after I get more conffident with it) to automate the webscraping process on a daily basis.

I am open to any suggestion or any other website that could be useful (I have tried with Yahoo Finance and Morningstar and I get the same problema with the HTML code).

This is my poor code:

Sub Get_Data()
    
    Dim ticker As String, enlace As String
    
    ticker = ThisWorkbook.Worksheets("ETFs").Cells(2, 2).Value 'IVV
    'link = "https://www.morningstar.com/etfs/arcx/" & ticker & "/quote.html"
    'link = "https://finance.yahoo.com/quote/" & ticker & "?p=" & ticker
    link = "https://www.marketwatch.com/investing/fund/" & ticker
        
    Application.ScreenUpdating = False
        
    Dim x As Integer
    x = ThisWorkbook.Worksheets("ETFs").Cells(Rows.Count, 1).End(xlUp).Row
    
    'Dim i As Integer
    'For i = 2 To x
    
    Dim total_net_assets As Variant, open_price As Variant, NAV As Variant, shares_out
            
    Set ie = CreateObject("InternetExplorer.application")
    With ie
        .Visible = False
        .navigate link
        While .Busy Or .readyState < 4: DoEvents: Wend
            Do
                DoEvents
                On Error Resume Next
                ' Here is where I get the problem of not knowing how to reference the values I need because the class name appears repeatedly
                total_net_assets = .document.getElementsByClassName("").Value
                open_price = .document.getElementByClassName("price").Value
                NAV = .document.getElementByClassName("").Value
                shares_out = .document.getElementByClassName("kv__value kv__primary ").Value
                On Error GoTo 0
            Loop
    End With
    ThisWorkbook.Worksheets("ETFs").Cells(2, 13).Value = total_net_assets
    ThisWorkbook.Worksheets("ETFs").Cells(2, 14).Value = NAV
    ThisWorkbook.Worksheets("ETFs").Cells(2, 15).Value = open_price
    ThisWorkbook.Worksheets("ETFs").Cells(2, 16).Value = shares_out
    ie.Quit
    'Next i
    Application.ScreenUpdating = True

End Sub

Solution

  • Access method:

    I use XMLHTTP requests as much faster than opening IE.

    Code notes:

    The following reads in fund short codes from Sheet1 column A, starting in A2, into an array. You can easily extend this adding more funds into column A.

    This array is looped issuing XMLHTTP requests by concatenating the fund code into the BASE_URL variable.

    I use a class, clsHTTP, to hold the XMLHTTP object to be efficient - no need to keep creating and destroying the object.

    I provide this class with two methods. One to retrieve the target page innerHTML (GetString), and the other to extract the required info if available (GetInfo). I use a dictionary to test if the searched for labels are present. If present I grab the associated value. If not, I have a placeholder vbNullString in the dictionary.

    I add each scraped result into a collection called results. At the end I loop this writing out to the sheet. By keeping most of the work in memory this provides for much faster scraping.


    Retrieving info from HTML:

    The labels e.g. Open, and values come in pairs.

    You can generate a nodeList (think collection as with getElementsByClassName) by using querySelectorAll method to apply a class CSS selector to gather the label elements by their class name kv__label. The "." is the class selector.

    Set labels = .querySelectorAll(".kv__label") '<== nodeList of labels
    

    You do the same to get the associated values:

    Set values = .querySelectorAll(".kv__value.kv__primary") '<== nodeList of associated values. Same length as labels nodeList so can use same index to retrieve associated label/value pairs from each nodeList.
    

    You loop the labels using the dictionary in the clsHTTP method .GetInfo to see if you searched for labels are present, if they are, the associated value is retrieved from values by using the same index as where the label was found in the nodeList labels, and the dictionary vbNullString value for that label is updated with the actual retrieved value, else it is left as vbNullString.


    Sample results:

    enter image description here

    VBA:

    Class module clsHTTP:

    Option Explicit
    Private http As Object
    
    Private Sub Class_Initialize()
        Set http = CreateObject("MSXML2.XMLHTTP")
    End Sub
    
    Public Function GetString(ByVal url As String) As String
        Dim sResponse As String
        With http
            .Open "GET", url, False
            .setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
            .send
            sResponse = StrConv(.responseBody, vbUnicode)
            GetString = sResponse
        End With
    End Function
    
    Public Function GetInfo(ByVal html As HTMLDocument) As Object
        Dim dict As Object, i As Long
        Set dict = CreateObject("Scripting.Dictionary")
        dict.Add "Open", vbNullString
        dict.Add "Shares Outstanding", vbNullString
        dict.Add "Total Net Assets", vbNullString
        dict.Add "NAV", vbNullString
    
        Dim values As Object, labels As Object
    
        With html
            Set values = .querySelectorAll(".kv__value.kv__primary")
            Set labels = .querySelectorAll(".kv__label")
    
            For i = 0 To labels.Length - 1
                If dict.Exists(labels.item(i).innerText) Then dict(labels.item(i).innerText) = values.item(i).innerText
            Next
        End With
        Set GetInfo = dict
    End Function
    

    Standard module 1:

    Option Explicit   
    Public Sub GetFundInfo()
        Dim sResponse As String, html As HTMLDocument, http As clsHTTP, i As Long
        Dim headers(), funds(), url As String, results As Collection, ws As Worksheet
        Const BASE_URL As String = "https://www.marketwatch.com/investing/fund/"
    
        Application.ScreenUpdating = False
    
        headers = Array("Open", "Shares Outstanding", "Total Net Assets", "NAV")
        Set results = New Collection
        Set http = New clsHTTP
        Set ws = ThisWorkbook.Worksheets("Sheet1")
        Set html = New HTMLDocument
    
        funds = Application.Transpose(ws.Range("A2:A3").Value) '<== Change the range here to the single column range containing your dotNums.
    
        For i = LBound(funds) To UBound(funds)
            If Not IsEmpty(funds(i)) Then
                url = BASE_URL & funds(i)
                html.body.innerHTML = http.GetString(url)
                results.Add http.GetInfo(html).Items
            End If
        Next
    
        If results.Count > 0 Then
            Dim item As Variant, r As Long, c As Long
            r = 2: c = 2
            With ws
                .Cells(1, c).Resize(1, UBound(headers) + 1) = headers
                For Each item In results
                    .Cells(r, c).Resize(1, UBound(item) + 1) = item
                    r = r + 1
                Next
            End With
        End If
        Application.ScreenUpdating = True
    End Sub
    

    Set-up:

    enter image description here


    Without using a class:

    Option Explicit
    
    Public Sub GetFundInfo()
        Dim sResponse As String, html As HTMLDocument,  i As Long
        Dim headers(), funds(), url As String, results As Collection, ws As Worksheet
        Const BASE_URL As String = "https://www.marketwatch.com/investing/fund/"
    
        Application.ScreenUpdating = False
    
        headers = Array("Open", "Shares Outstanding", "Total Net Assets", "NAV")
        Set results = New Collection
        Set ws = ThisWorkbook.Worksheets("Sheet1")
        Set html = New HTMLDocument
    
        funds = Application.Transpose(ws.Range("A2:A3").Value) '<== Change the range here to the single column range containing your dotNums.
    
        For i = LBound(funds) To UBound(funds)
            If Not IsEmpty(funds(i)) Then
                url = BASE_URL & funds(i)
                html.body.innerHTML = GetString(url)
                results.Add GetInfo(html).Items
            End If
        Next
    
        If results.Count > 0 Then
            Dim item As Variant, r As Long, c As Long
            r = 2: c = 2
            With ws
                .Cells(1, c).Resize(1, UBound(headers) + 1) = headers
                For Each item In results
                    .Cells(r, c).Resize(1, UBound(item) + 1) = item
                    r = r + 1
                Next
            End With
        End If
        Application.ScreenUpdating = True
    End Sub
    
    
    Public Function GetString(ByVal url As String) As String
        Dim http As Object
        Set http = CreateObject("MSXML2.XMLHTTP")
        Dim sResponse As String
        With http
            .Open "GET", url, False
            .setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
            .send
            sResponse = StrConv(.responseBody, vbUnicode)
            GetString = sResponse
        End With
    End Function
    
    Public Function GetInfo(ByVal html As HTMLDocument) As Object
        Dim dict As Object, i As Long
        Set dict = CreateObject("Scripting.Dictionary")
        dict.Add "Open", vbNullString
        dict.Add "Shares Outstanding", vbNullString
        dict.Add "Total Net Assets", vbNullString
        dict.Add "NAV", vbNullString
    
        Dim values As Object, labels As Object
    
        With html
            Set values = .querySelectorAll(".kv__value.kv__primary")
            Set labels = .querySelectorAll(".kv__label")
    
            For i = 0 To labels.Length - 1
                If dict.Exists(labels.item(i).innerText) Then dict(labels.item(i).innerText) = values.item(i).innerText
            Next
        End With
        Set GetInfo = dict
    End Function