Search code examples
vbaweb-scrapingmshtmlselectors-apinodelist

How to hold a reference to the items matched by `querySelectorAll`, in a variable, that allows you to access its methods?


Intro:

Some of you may have noticed that something has broken in relation to the querySelectorAll method of MSHTML.HTMLDocument from MSHTML.Dll (via a Microsoft HTML Document Library reference). This, I believe, has happened in the last month. It may not affect all users and I will update this Q&A as I get more info on which versions etc are affected. Please feel free to comment below with your set-up and whether working or not for both late-bound and early-bound (as per code in answer)


Accessing DispStaticNodeList methods:

Traditionally, at least in my experience, it has been the norm to hold a reference to the DispStaticNodeList, which is what querySelectorAll returns, in a generic late-bound Object type:

E.g.

Dim nodeList1 As Object

Set nodeList1 = html.querySelectorAll("a")

where html is an instance of MSHTML.HTMLDocument.

As you can see from the Locals window, you get the expected nodeList shown:

enter image description here

You could then access the list of the document's elements, that match the specified group of selectors, with .item(index), and get the number of items matched with .Length. E.g.

Debug.Print nodeList1.item(0).innerText
Debug.Print nodeList1.Length

What happens now?

Attempts to access the methods, via late bound Object, and its underlying interfaces, lead to either an Object required, when using the .item() method call, or Null when querying the .Length(). E.g.

nodeList1.item(0).innertext  ' => Run-time error '424': Object required
Debug.Print nodeList1.Length ' => Null 

This happens when you hold a reference through assigning to a variable.


What you can do:

You can use With and work off html, avoiding the Object class

With html.querySelectorAll("a")
    For i = 0 To .Length - 1
       Debug.Print .Item(i).innerText
    Next
End With

So, I think the problem is very much about the Object data type and its underlying interfaces. And possibly, something about this has broken in relation to MSHTML, and most likely, the now no longer supported, Internet Explorer, which sits in the background:

However, this is not desirable, as you parse, and re-parse, the same HTML, during the loop, losing much of the efficiency that its gained by choosing css selectors over traditional methods e.g. getElementsByClassName. Those traditional methods remain intact.


Why do some of us care?

Modern browsers (and even IE8 onwards) support faster node matching through use of css selectors. It seems reasonable to assume that this carried over into the DOM parsers with MSHTML.HTMLDocument. So, you have faster matching, combined with more expressive and concise syntax (none of those long chained method calls e.g. getElementsByClassName("abc")(0).getElementsByTagName("def")(0).....), the ability to return more desired nodes, without repeated calls (in the prior example you will only get def as children of the first element with class abc, rather than all children, with tag def, of all elements with class abc, which you would get with querySelectorAll(".abc def"). And, you lose the flexibility to specify much more complex and specific patterns for node matching e.g. querySelectorAll(".abc > def + #ghi). For those interested, you can read more about those selectors on MSDN.


Question:

So, how does one avoid re-parsing, and hold the reference to the returned list of matched nodes? I have found nothing on the internet, despite quite a bit of searching, that documents this recent change in behaviour. It is also a very recent change and that likely only affects a small user base.

I hope the above satisfies the need to demonstrate research into the problem.


My set-up:

OS Name Microsoft Windows 10 Pro
Version 10.0.19042 Build 19042
System Type x64-based PC
Microsoft® Excel® 2019 MSO (16.0.13929.20206) 32-bit (Microsoft Office Professional Plus)
Version 2104 Build 13929.20373
mshtml.dll info as per image

Not affected (TBD):

  1. Office Professional plus 2013. Win 7, 32 bit, MSHTML.dll 11.0.9600.19597

Solution

  • Do not despair VBA web-scrapers (I know there are a few!) We can still have the luxury of css selectors and the benefits, though admittedly somewhat limited in VBA, that they bring.

    To the rescue:

    MSHTML, gratias IE, offers a number of scripting object interfaces . One of which is the IHTMLDOMChildrenCollection interface, which inherits from IDispatch, and which:

    provides methods to access items in the collection.

    This includes the .Length property and access to items via .item(index).

    Dim nodeList2 As MSHTML.IHTMLDOMChildrenCollection
    
    Set nodeList2 = html.querySelectorAll("a")
    Debug.Print nodeList2.Length                 ' => n 
    Debug.Print nodeList2.Item(0).innerText
    

    This is supported on clients Windows XP +, and servers from Windows 2000 Server onwards.


    VBA:

    Public Sub ReviewingNodeListMethods()
        '' References (VBE > Tools > References):
              ''Microsoft HTML object Library
              ''Microsoft XML library (v.6 for me)
    
        Dim http As MSXML2.XMLHTTP60, html As MSHTML.HTMLDocument   'XMLHTTP60 is for Excel 2016. Change according to your version e.g. XMLHTTP for 2013)
        
        Set http = New MSXML2.XMLHTTP60: Set html = New MSHTML.HTMLDocument
        
        With http
            .Open "GET", "http://books.toscrape.com/", False
            .send
            html.body.innerHTML = .responseText
        End With
    
        Dim nodeList1 As Object, nodeList2 As MSHTML.IHTMLDOMChildrenCollection
        
        Set nodeList1 = html.querySelectorAll("a")
        Set nodeList2 = html.querySelectorAll("a")
      
        Debug.Print nodeList1.Length                 ' => Null
        Debug.Print nodeList2.Length                 ' => 94
        
        Debug.Print nodeList2.Item(0).innerText
        
        '    Dim i As Long
        '
        '    With html.querySelectorAll("a")
        '        For i = 0 To .Length - 1
        '           Debug.Print .Item(i).innerText
        '        Next
        '    End With
        
        '' ================Warning: This will crash Excel -============================
    
        '    Dim node As MSHTML.IHTMLDOMNode
        '
        '    For Each node In nodeList2
        '        Debug.Print node.innerText
        '    Next
        '' ================Warning: This will crash Excel -============================
    
    End Sub
    

    N.B. There is still the underlying problem of the collection enumeration method; it causes Excel to crash if you attempt a For Each e.g.

    Dim node As MSHTML.IHTMLDOMNode
    
    For Each node In nodeList2
        Debug.Print node.innerText
    Next
    

    Updating your old Questions/Answers:

    1. You can use this SEDE query to identify potential candidates for revision. Enter your userid and the search term "querySelectorAll"
    2. Or simply use the following in the search bar: querySelectorAll user:<userid> is:answer ; querySelectorAll user:<userid> is:question