Intro:
Some of you may have noticed that something has broken in relation to the querySelectorAll
method of MSHTML.HTMLDocument
from MSHTML.Dll
(via a Microsoft HTML Document Library
reference). This, I believe, has happened in the last month. It may not affect all users and I will update this Q&A as I get more info on which versions etc are affected. Please feel free to comment below with your set-up and whether working or not for both late-bound and early-bound (as per code in answer)
Accessing DispStaticNodeList
methods:
Traditionally, at least in my experience, it has been the norm to hold a reference to the DispStaticNodeList
, which is what querySelectorAll
returns, in a generic late-bound Object
type:
E.g.
Dim nodeList1 As Object
Set nodeList1 = html.querySelectorAll("a")
where html
is an instance of MSHTML.HTMLDocument
.
As you can see from the Locals window, you get the expected nodeList
shown:
You could then access the list of the document's elements, that match the specified group of selectors, with .item(index)
, and get the number of items matched with .Length
. E.g.
Debug.Print nodeList1.item(0).innerText
Debug.Print nodeList1.Length
What happens now?
Attempts to access the methods, via late bound Object
, and its underlying interfaces, lead to either an Object required
, when using the .item()
method call, or Null
when querying the .Length()
. E.g.
nodeList1.item(0).innertext ' => Run-time error '424': Object required
Debug.Print nodeList1.Length ' => Null
This happens when you hold a reference through assigning to a variable.
What you can do:
You can use With
and work off html
, avoiding the Object
class
With html.querySelectorAll("a")
For i = 0 To .Length - 1
Debug.Print .Item(i).innerText
Next
End With
So, I think the problem is very much about the Object
data type and its underlying interfaces. And possibly, something about this has broken in relation to MSHTML, and most likely, the now no longer supported, Internet Explorer, which sits in the background:
However, this is not desirable, as you parse, and re-parse, the same HTML, during the loop, losing much of the efficiency that its gained by choosing css selectors over traditional methods e.g. getElementsByClassName
. Those traditional methods remain intact.
Why do some of us care?
Modern browsers (and even IE8 onwards) support faster node matching through use of css selectors. It seems reasonable to assume that this carried over into the DOM parsers with MSHTML.HTMLDocument
. So, you have faster matching, combined with more expressive and concise syntax (none of those long chained method calls e.g. getElementsByClassName("abc")(0).getElementsByTagName("def")(0).....
), the ability to return more desired nodes, without repeated calls (in the prior example you will only get def
as children of the first element with class abc
, rather than all children, with tag def
, of all elements with class abc
, which you would get with querySelectorAll(".abc def")
. And, you lose the flexibility to specify much more complex and specific patterns for node matching e.g. querySelectorAll(".abc > def + #ghi)
. For those interested, you can read more about those selectors on MSDN.
Question:
So, how does one avoid re-parsing, and hold the reference to the returned list of matched nodes? I have found nothing on the internet, despite quite a bit of searching, that documents this recent change in behaviour. It is also a very recent change and that likely only affects a small user base.
I hope the above satisfies the need to demonstrate research into the problem.
My set-up:
OS Name Microsoft Windows 10 Pro
Version 10.0.19042 Build 19042
System Type x64-based PC
Microsoft® Excel® 2019 MSO (16.0.13929.20206) 32-bit (Microsoft Office Professional Plus)
Version 2104 Build 13929.20373
mshtml.dll info as per image
Not affected (TBD):
Do not despair VBA web-scrapers (I know there are a few!) We can still have the luxury of css selectors and the benefits, though admittedly somewhat limited in VBA, that they bring.
To the rescue:
MSHTML
, gratias IE, offers a number of scripting object interfaces . One of which is the IHTMLDOMChildrenCollection
interface, which inherits from IDispatch
, and which:
provides methods to access items in the collection.
This includes the .Length
property and access to items via .item(index)
.
Dim nodeList2 As MSHTML.IHTMLDOMChildrenCollection
Set nodeList2 = html.querySelectorAll("a")
Debug.Print nodeList2.Length ' => n
Debug.Print nodeList2.Item(0).innerText
This is supported on clients Windows XP +, and servers from Windows 2000 Server onwards.
VBA:
Public Sub ReviewingNodeListMethods()
'' References (VBE > Tools > References):
''Microsoft HTML object Library
''Microsoft XML library (v.6 for me)
Dim http As MSXML2.XMLHTTP60, html As MSHTML.HTMLDocument 'XMLHTTP60 is for Excel 2016. Change according to your version e.g. XMLHTTP for 2013)
Set http = New MSXML2.XMLHTTP60: Set html = New MSHTML.HTMLDocument
With http
.Open "GET", "http://books.toscrape.com/", False
.send
html.body.innerHTML = .responseText
End With
Dim nodeList1 As Object, nodeList2 As MSHTML.IHTMLDOMChildrenCollection
Set nodeList1 = html.querySelectorAll("a")
Set nodeList2 = html.querySelectorAll("a")
Debug.Print nodeList1.Length ' => Null
Debug.Print nodeList2.Length ' => 94
Debug.Print nodeList2.Item(0).innerText
' Dim i As Long
'
' With html.querySelectorAll("a")
' For i = 0 To .Length - 1
' Debug.Print .Item(i).innerText
' Next
' End With
'' ================Warning: This will crash Excel -============================
' Dim node As MSHTML.IHTMLDOMNode
'
' For Each node In nodeList2
' Debug.Print node.innerText
' Next
'' ================Warning: This will crash Excel -============================
End Sub
N.B. There is still the underlying problem of the collection enumeration method; it causes Excel to crash if you attempt a For Each
e.g.
Dim node As MSHTML.IHTMLDOMNode
For Each node In nodeList2
Debug.Print node.innerText
Next
Updating your old Questions/Answers:
querySelectorAll user:<userid> is:answer
; querySelectorAll user:<userid> is:question