Search code examples
javascripthtmlvbaparsingxmlhttprequest

How to get data generated by script on HTML page using VBA?


I am trying to parse a Website using VBA. My goal is to extract the data of the company (The name, website, location, investors, date of investment, amount of investment etc.) and to store it in an excel sheet. My function to extract the HTML source is the following (it works, at least on my computer...). I use late binding to have a better portability.

Private Sub getHTMLFromURL()

    Dim objHTTP As Object
    Dim html As Object
    Dim implTmp As Object

    Set objHTTP = VBA.CreateObject("MSXML2.XMLHTTP")
    Set html = CreateObject("htmlfile")

    on error GoTo endProgram

    With objHTTP
        .Open "GET", URLStr, False
        .send

        If .READYSTATE = 4 And .Status = 200 Then
            html.Open
            html.write .responseText
            html.Close
        Else
            Debug.Print "Error" & vbNewLine & "Ready state: " & .READYSTATE & _
            vbNewLine & "HTTP request status: " & .Status
            GoTo endProgram
        End If

    End With

endProgram:
    Set html = Nothing
    Set objHTTP = Nothing
    If Err <> 0 Then
        Debug.Print "Error in getHTMLFromURL " & Err.Number & " - " & Err.Description
    End If

End Sub

My problem is that the HTML is one huge SCRIPT tag in which the data are hidden. For example, I guess the Headquarter location of the company is in this line of code:

{"key":"hqLocations","name":"Location","sortable":false,"getText":"function getText(c) {\n    return getHQCity(c.hqLocations);\n  }"}

I humbly admit that I have absolutely no clue on how to get these data. I have searched a lot in multiple forums without finding a fitting answer. I tried to adapt this method without success. Thus I have several questions linked to my issue:

  • Is using MSXML2.XMLHTTP the best way to do so?
  • Do I have to fish for each variable or is there a way to directly interpret the script to have an HTML with all the data (which would be much easier then)?
  • Otherwise, how can I extract all the data (single data and arrays)?

Thanks a lot


Solution

  • Try the below script. When you run it, you should get the required data you have requested in your post:

    Sub Fetch_Data()
        Dim IE As New InternetExplorer, HTML As HTMLDocument
        Dim posts As Object, post As Object, hdata As Object
        Dim elem As Object, trow As Object
    
        With IE
            .Visible = False
            .navigate "http://app.startupeuropeclub.eu/companies/pld_space"
            While .Busy = True Or .readyState < 4: DoEvents: Wend
            Set HTML = .document
        End With
    
        ''the following line is let the scraper wait until the data is completely loaded
        Do: Set hdata = HTML.getElementsByClassName("field"): DoEvents: Loop Until hdata.Length > 1
    
        For Each post In hdata
            With post.getElementsByClassName("title")
                If .Length Then R = R + 1: Cells(R, 1) = .Item(0).innerText
            End With
            With post.getElementsByClassName("description")
                If .Length Then Cells(R, 2) = .Item(0).innerText
            End With
        Next post
    
        ''avoiding hardcoded delay and wait until the data is completely loaded
        Do: Set posts = HTML.querySelector(".card-content.with-padding table.simple-table"): DoEvents: Loop While posts Is Nothing
    
        For Each elem In posts.Rows
            For Each trow In elem.Cells
                C = C + 1: Cells(I + 10, C) = trow.innerText
            Next trow
            C = 0
            I = I + 1
        Next elem
        IE.Quit
    End Sub
    

    Reference to add to the library:

    Microsoft Internet Controls
    Microsoft HTML Object Library