I am trying to parse a Website using VBA. My goal is to extract the data of the company (The name, website, location, investors, date of investment, amount of investment etc.) and to store it in an excel sheet. My function to extract the HTML source is the following (it works, at least on my computer...). I use late binding to have a better portability.
Private Sub getHTMLFromURL()
Dim objHTTP As Object
Dim html As Object
Dim implTmp As Object
Set objHTTP = VBA.CreateObject("MSXML2.XMLHTTP")
Set html = CreateObject("htmlfile")
on error GoTo endProgram
With objHTTP
.Open "GET", URLStr, False
.send
If .READYSTATE = 4 And .Status = 200 Then
html.Open
html.write .responseText
html.Close
Else
Debug.Print "Error" & vbNewLine & "Ready state: " & .READYSTATE & _
vbNewLine & "HTTP request status: " & .Status
GoTo endProgram
End If
End With
endProgram:
Set html = Nothing
Set objHTTP = Nothing
If Err <> 0 Then
Debug.Print "Error in getHTMLFromURL " & Err.Number & " - " & Err.Description
End If
End Sub
My problem is that the HTML is one huge SCRIPT tag in which the data are hidden. For example, I guess the Headquarter location of the company is in this line of code:
{"key":"hqLocations","name":"Location","sortable":false,"getText":"function getText(c) {\n return getHQCity(c.hqLocations);\n }"}
I humbly admit that I have absolutely no clue on how to get these data. I have searched a lot in multiple forums without finding a fitting answer. I tried to adapt this method without success. Thus I have several questions linked to my issue:
Thanks a lot
Try the below script. When you run it, you should get the required data you have requested in your post:
Sub Fetch_Data()
Dim IE As New InternetExplorer, HTML As HTMLDocument
Dim posts As Object, post As Object, hdata As Object
Dim elem As Object, trow As Object
With IE
.Visible = False
.navigate "http://app.startupeuropeclub.eu/companies/pld_space"
While .Busy = True Or .readyState < 4: DoEvents: Wend
Set HTML = .document
End With
''the following line is let the scraper wait until the data is completely loaded
Do: Set hdata = HTML.getElementsByClassName("field"): DoEvents: Loop Until hdata.Length > 1
For Each post In hdata
With post.getElementsByClassName("title")
If .Length Then R = R + 1: Cells(R, 1) = .Item(0).innerText
End With
With post.getElementsByClassName("description")
If .Length Then Cells(R, 2) = .Item(0).innerText
End With
Next post
''avoiding hardcoded delay and wait until the data is completely loaded
Do: Set posts = HTML.querySelector(".card-content.with-padding table.simple-table"): DoEvents: Loop While posts Is Nothing
For Each elem In posts.Rows
For Each trow In elem.Cells
C = C + 1: Cells(I + 10, C) = trow.innerText
Next trow
C = 0
I = I + 1
Next elem
IE.Quit
End Sub
Reference to add to the library:
Microsoft Internet Controls
Microsoft HTML Object Library