Search code examples
xmlvbaweb-scrapingxml-parsingfinance

Parsing XML document using VBA (MSXML V6.0)


I'm trying to parse financial information directly from the SEC and have a question about parsing XML documents using VBA in excel. I understand that it's possible to work through the document referencing child nodes and item numbers, but the document is huge and it would take forever to read through and identify each item I need.

I saw online that it's common to use XPATH to more efficiently query XML documents. I've tried many approaches but I've had no success so far. I believe my issue is understanding which namespace the elements are in and how to properly reference the elements under a specified namespace.

Below is a portion of my code trying to reference an arbitrary element,

Sub SecData()

Dim xml_obj As MSXML2.XMLHTTP60
Dim xDoc As New MSXML2.DOMDocument60
Dim xml_url As String
Dim nodes As Variant

Set xml_obj = New MSXML2.XMLHTTP60

xml_url = "https://www.sec.gov/Archives/edgar/data/320193/000032019321000010/aapl-20201226_htm.xml"
xml_obj.Open bstrMethod:="GET", bstrURL:=xml_url
xml_obj.send

xDoc.LoadXML (xml_obj.responseText)

xDoc.SetProperty "SelectionLanguage", "XPath"
xDoc.SetProperty "SelectionNamespaces", "xmlns:link='http://www.xbrl.org/2003/linkbase'"

nodes = xDoc.SelectNodes("//RevenueFromContractWithCustomerExcludingAssessedTax")

The XML document is too large to include in the question so I'll leave a link below,

https://www.sec.gov/Archives/edgar/data/320193/000032019321000010/aapl-20201226_htm.xml

Any help would be greatly appreciated!

Thanks


Solution

  • XML: do child nodes inherit parent's namespace prefix? covers namespace inheritance.

    If a namespace has no prefix such as this one:

     xmlns="http://www.xbrl.org/2003/instance"
    

    then it's inherited by anything below it.

    If there's a prefix (here "xbrldi")

     xmlns:xbrldi="http://xbrl.org/2006/xbrldi"
    

    then it's only inherited when explicitly used in an element such as:

    <xbrldi:explicitMember dimension="us-gaap:StatementClassOfStockAxis">
    

    The element in your query has "us-gaap" as a namespace alias, so you need to add it to the namespaces collection, and include the alias in your xpath:

    us-gaap:RevenueFromContractWithCustomerExcludingAssessedTax
    

    For example:

    Dim xDoc As New MSXML2.DOMDocument60
    Dim nodes As Variant
    
    xDoc.resolveExternals = False  'don't try to resolve external resources
    xDoc.validateOnParse = True
    
    xDoc.Load "C:\Temp\aapl-20201226_htm.xml"  'using local copy for testing...
    
    Debug.Print xDoc.parseError.reason 'in case of problems
    
    xDoc.SetProperty "SelectionLanguage", "XPath"
    
    'add namespaces: the first one is the default namespace with a "dummy" prefix of "xxx"
    xDoc.SetProperty "SelectionNamespaces", _
                    "xmlns:xxx='http://www.xbrl.org/2003/instance' " & _
                    "xmlns:link='http://www.xbrl.org/2003/linkbase' " & _
                    "xmlns:us-gaap='http://fasb.org/us-gaap/2020-01-31'"
                    '+ other namespaces as needed...
    
    'element with no prefix: using the "dummy" `xxx` prefix we added for the default namespace
    Set nodes = xDoc.SelectNodes("//xxx:context")
    Debug.Print nodes.Length  ' 207
    
    'these elements belong to a specific namespace so use that prefix...
    Set nodes = xDoc.SelectNodes("//us-gaap:RevenueFromContractWithCustomerExcludingAssessedTax")
    Debug.Print nodes.Length  ' 28