Search code examples
perlxpathxml-libxml

Perl libXML search the default namespace using findnodes


Given a XML file with multiple namespaces defined, what is the simplest way to search the DOM for elements just in the default namespace using an XPath query?

As the title suggests this is using Perl and libXML.

Furthermore, is it possible to do this without hardcoding the namespace (if using XPathContext to define the namespace is it possible to query the default namespace of the file)

What I'm trying to achieve:
I'm searching many xlsx spreadsheet documents of different ages for certain formulas and processing these. I was homing to just use a simple findnodes(//f) to gather all formulas in each sheet. All of the sheets have multiple namespaces defined but most elements don't seem to have a fully qualified namespace. For example:

<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:xdr="http://schemas.openxmlformats.org/drawingml/2006/spreadsheetDrawing" xmlns:x14="http://schemas.microsoft.com/office/spreadsheetml/2009/9/main" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" mc:Ignorable="x14ac" xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac">
<sheetData>
    <row r="1">
        <c r="A1">
            <f>SUM(1+2)</f>
            <v>3</v>
        </c>
        <c r="A2">
            <f>SUM(4+5)</f>
            <v>9</v>
        </c>
...
<controls>
    <mc:AlternateContent xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006">
        <mc:Choice Requires="x14">
            <control shapeId="1" r:id="rId4" name="blah">
...

As I mentioned above I only care about the formulas ie: in the example above "SUM(1+2)" and "SUM(4+5)".

How can I extract just this data out?
The solution doesn't have to be pretty but it does have to always work (I'm not sure if the namespaces change much.)

I could just pipe everything through grep/sed, but was hoping properly parsing it wouldn't be too hard...


Solution

  • You can ignore the namespaces completely with local-name():

    ...->findnodes('//*[local-name()="f"]')
    

    Note that in general, it's not the best idea. E.g., if the syntax of the formulas depended on the version and you needed to normalize them, you would search for formulas in each namespace separately and run different conversions based on the namespace.