Search code examples
rxmllibxml2

Why does //NODE not find any elements, but //*[name() = 'NODE'] does?


I am trying to parse the XML (or maybe HTML?) output of the San Francisco transit Operators API (free API key required):

https://511.org/open-data/transit

Pasted the full XML string into this Gist since it's so long and I haven't bothered minimizing the example: https://gist.github.com/MichaelChirico/7a3a5bb95d577d8d83ebea37c44320d0

I'm using R's xml2 package to process this, which uses libxml2 as a backend:

https://github.com/r-lib/xml2

For some reason, I can't find Operator nodes in the normal way:

library(xml2)
s = ' <xml string here> '
xml = read_xml(s)
xml_find_all(xml, "//Operator")
# {xml_nodeset (0)}

However, name() finds Operator as the correct node name:

# Using '*' because some intermediate nodes have the same issue,
#   basically anything nested beyond a `siri:` node.
xml_find_chr(xml, 'name(*/*/*/*/*/*)')
# [1] "Operator"

And this convoluted approach works:

xml_find_all(xml, '//*[name() = "Operator"]') |> head()
# {xml_nodeset (6)}
#  [1] <Operator id="5E" version="any">\n  <Extensions>\n    <Monitored>false</Monitored>\n    <OtherM ...
#  [2] <Operator id="5F" version="any">\n  <Extensions>\n    <Monitored>false</Monitored>\n    <OtherM ...
#  [3] <Operator id="5O" version="any">\n  <Extensions>\n    <Monitored>false</Monitored>\n    <OtherM ...
#  [4] <Operator id="5S" version="any">\n  <Extensions>\n    <Monitored>false</Monitored>\n    <OtherM ...
#  [5] <Operator id="AC" version="any">\n  <Extensions>\n    <Monitored>true</Monitored>\n    <OtherMo ...
#  [6] <Operator id="CE" version="any">\n  <Extensions>\n    <Monitored>false</Monitored>\n    <OtherM ...

Is this a bug, or am I doing something wrong?


Solution

  • The XML in question has multiple namespaces.

    <?xml version="1.0" encoding="iso-8859-1"?>
    <siri:Siri xsi:schemaLocation="http://www.siri.org.uk/siri  http://www.kizoom.com/standards/netex/schema/0.99.1/xsd/NeTEx_siri.xsd"
        xmlns:siri="http://www.siri.org.uk/siri"
        xmlns="http://www.netex.org.uk/netex"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xmlns:gml="http://www.opengis.net/gml" version="1.0">
        <siri:ServiceDelivery>
            <siri:ResponseTimestamp>2023-06-21T23:27:30-07:00</siri:ResponseTimestamp>
            <DataObjectDelivery>
                <siri:ResponseTimestamp>2023-06-21T23:27:30-07:00</siri:ResponseTimestamp>
                <dataObjects>
                    <ResourceFrame id="SF" version="any">
                        <organisations>
                            <Operator id="5E" version="any">
    

    Two of them are relevant to the <Operator> XML element:

    • xmlns:siri="http://www.siri.org.uk/siri"
    • xmlns="http://www.netex.org.uk/netex" , i.e. default namespace.

    So, the fully qualified XPath expression would be as follows:

    /siri:Siri/siri:ServiceDelivery/ns1:DataObjectDelivery/ns1:dataObjects/ns1:ResourceFrame/ns1:organisations/ns1:Operator
    

    Where ns1 is an alias for the default namespace.

    As end result, you need to add namespaces handling to your code and use a proper XPath expression(s).