I'm trying to parse financial information directly from the SEC and have a question about parsing XML documents using VBA in excel. I understand that it's possible to work through the document referencing child nodes and item numbers, but the document is huge and it would take forever to read through and identify each item I need.
I saw online that it's common to use XPATH to more efficiently query XML documents. I've tried many approaches but I've had no success so far. I believe my issue is understanding which namespace the elements are in and how to properly reference the elements under a specified namespace.
Below is a portion of my code trying to reference an arbitrary element,
Sub SecData()
Dim xml_obj As MSXML2.XMLHTTP60
Dim xDoc As New MSXML2.DOMDocument60
Dim xml_url As String
Dim nodes As Variant
Set xml_obj = New MSXML2.XMLHTTP60
xml_url = "https://www.sec.gov/Archives/edgar/data/320193/000032019321000010/aapl-20201226_htm.xml"
xml_obj.Open bstrMethod:="GET", bstrURL:=xml_url
xml_obj.send
xDoc.LoadXML (xml_obj.responseText)
xDoc.SetProperty "SelectionLanguage", "XPath"
xDoc.SetProperty "SelectionNamespaces", "xmlns:link='http://www.xbrl.org/2003/linkbase'"
nodes = xDoc.SelectNodes("//RevenueFromContractWithCustomerExcludingAssessedTax")
The XML document is too large to include in the question so I'll leave a link below,
https://www.sec.gov/Archives/edgar/data/320193/000032019321000010/aapl-20201226_htm.xml
Any help would be greatly appreciated!
Thanks
XML: do child nodes inherit parent's namespace prefix? covers namespace inheritance.
If a namespace has no prefix such as this one:
xmlns="http://www.xbrl.org/2003/instance"
then it's inherited by anything below it.
If there's a prefix (here "xbrldi")
xmlns:xbrldi="http://xbrl.org/2006/xbrldi"
then it's only inherited when explicitly used in an element such as:
<xbrldi:explicitMember dimension="us-gaap:StatementClassOfStockAxis">
The element in your query has "us-gaap" as a namespace alias, so you need to add it to the namespaces collection, and include the alias in your xpath:
us-gaap:RevenueFromContractWithCustomerExcludingAssessedTax
For example:
Dim xDoc As New MSXML2.DOMDocument60
Dim nodes As Variant
xDoc.resolveExternals = False 'don't try to resolve external resources
xDoc.validateOnParse = True
xDoc.Load "C:\Temp\aapl-20201226_htm.xml" 'using local copy for testing...
Debug.Print xDoc.parseError.reason 'in case of problems
xDoc.SetProperty "SelectionLanguage", "XPath"
'add namespaces: the first one is the default namespace with a "dummy" prefix of "xxx"
xDoc.SetProperty "SelectionNamespaces", _
"xmlns:xxx='http://www.xbrl.org/2003/instance' " & _
"xmlns:link='http://www.xbrl.org/2003/linkbase' " & _
"xmlns:us-gaap='http://fasb.org/us-gaap/2020-01-31'"
'+ other namespaces as needed...
'element with no prefix: using the "dummy" `xxx` prefix we added for the default namespace
Set nodes = xDoc.SelectNodes("//xxx:context")
Debug.Print nodes.Length ' 207
'these elements belong to a specific namespace so use that prefix...
Set nodes = xDoc.SelectNodes("//us-gaap:RevenueFromContractWithCustomerExcludingAssessedTax")
Debug.Print nodes.Length ' 28