Search code examples
excelvbaweb-scrapinggetelementsbyclassname

Excel vba getElementsByClassName


I am trying to scrape IPO date from crunchbase. Unfortunately I get Runtime Error 1004 “Application-defined or Object-defined error”. My goal is to save IPO date in A1 cell.

Sub GetIE()
  Dim IE As Object
  Dim URL As String
  Dim myValue As IHTMLElement
  URL = "https://www.crunchbase.com/organization/verastem"
  Set IE = CreateObject("InternetExplorer.Application")
  IE.Visible = True
  IE.Navigate URL
  Do While IE.Busy Or IE.ReadyState <> 4
    DoEvents
  Loop
  Set myValue = IE.Document.getElementsByClassName("post_glass post_micro_glass")(0)
  Range("A1").Value = myValue
  Set IE = Nothing
End Sub

Solution

  • I can't find that class name in the html for that url. You can use the css selector I show below which can be scraped by xmlhttp and thus avoiding opening a browser

    Option Explicit    
    Public Sub GetDate()
        Dim html As HTMLDocument
        Set html = New HTMLDocument                  '<  VBE > Tools > References > Microsoft Scripting Runtime
        With CreateObject("MSXML2.XMLHTTP")
            .Open "GET", "https://www.crunchbase.com/organization/verastem#section-overview", False
            .send
            html.body.innerHTML = .responseText
        End With
        ActiveSheet.Range("A1") = html.querySelectorAll(".field-type-date.ng-star-inserted").item(1).innerText
    End Sub
    

    If you don't want to use compound classes then you can also use

    ActiveSheet.Range("A1") = html.querySelectorAll("#section-ipo-stock-price .field-type-date").item(1).innerText
    

    You can see the relevant html here:

    enter image description here

    Note the element has multiple (compound) classes

    <span class="component--field-formatter field-type-date ng-star-inserted" title="Jan 27, 2012">Jan 27, 2012</span>
    

    There are 3 classes component--field-formatter ; field-type-date and ng-star-inserted. I use two of these in combination in the first solution I give. Multiple classes is popular now-a-days due to the versatility it gives in page styling e.g. it allows overriding styles easily. You can read about css specificity* to understand this better.

    More classes may mean the code is a little less robust as the ordering of classes may be changed and a class, or more, may be removed. This was raised by @SIM in a comment on an answer to another web-scraping question. Thus, I offer one solution with two of the classes used, and another solution with only one of the classes used.


    Whilst you do get the same date for this page with simply:

    ActiveSheet.Range("A1") = html.querySelector("#section-ipo-stock-price .field-type-date").innerText
    

    I wouldn't want to assume that would always hold true as it grabs the date from the line where it says "Their stock opened".

    * https://developer.mozilla.org/en-US/docs/Web/CSS/Specificity

    References:

    1. querySelectorAll
    2. css selectors