Search code examples
rubyxmlxml-parsingnokogiridocx

Access deep nested node from document.xml using nokogiri


I am using nokogiri to access a docx's document xml file.

here is a sample of it:

<w:document>
    <w:body>
        <w:p w:rsidR="00454EDC" w:rsidRDefault="00454EDC" w:rsidP="00454EDC">
                <w:drawing>
                    <wp:inline distT="0" distB="0" distL="0" distR="0">
                        <wp:extent cx="1926590" cy="1088571"/>
                        <wp:effectExtent l="0" t="0" r="0" b="0"/>
                        <wp:docPr id="1" name="Picture 1"/>
                        <wp:cNvGraphicFramePr>
                            <a:graphicFrameLocks xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" noChangeAspect="1"/>
                        </wp:cNvGraphicFramePr>
                        <a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
                            <a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
                                <pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
                                    <pic:nvPicPr>
                                        <pic:cNvPr id="0" name="Picture 1"/>
                                        <pic:cNvPicPr>
                                            <a:picLocks noChangeAspect="1" noChangeArrowheads="1"/>
                                        </pic:cNvPicPr>
                                    </pic:nvPicPr>
                                    <pic:blipFill>
                                        <a:blip r:embed="rId5" cstate="print">
                                            <a:extLst>
                                                <a:ext uri="{28A0092B-C50C-407E-A947-70E740481C1C}">
                                                    <a14:useLocalDpi xmlns:a14="http://schemas.microsoft.com/office/drawing/2010/main" val="0"/>
                                                </a:ext>
                                            </a:extLst>
                                        </a:blip>
                                        <a:srcRect/>
                                        <a:stretch>
                                            <a:fillRect/>
                                        </a:stretch>
                                    </pic:blipFill>
                                    <pic:spPr bwMode="auto">
                                        <a:xfrm>
                                            <a:off x="0" y="0"/>
                                            <a:ext cx="1951299" cy="1102532"/>
                                        </a:xfrm>
                                        <a:prstGeom prst="rect">
                                            <a:avLst/>
                                        </a:prstGeom>
                                        <a:noFill/>
                                        <a:ln>
                                            <a:noFill/>
                                        </a:ln>
                                    </pic:spPr>
                                </pic:pic>
                            </a:graphicData>
                        </a:graphic>
                    </wp:inline>
                </w:drawing>
            </w:p>
    </w:body>
</w:document>

Now I want to access all <w:drawing> tags and from them I wan to access <a:blip> tag and extract the value of attribute of r:embed from it.

In this case as you can see it is rId5

I am able to access the <w:drawing> tag by using xml.xpath('//w:drawing') but when I do so xml.xpath('//w:drawing').xpath('//a:blip'), it throws error :

Nokogiri::XML::XPath::SyntaxError: Undefined namespace prefix: //a:blip

What am I doing wrong, can anyone point me in the right direction?


Solution

  • The error is telling you that in your XPath query, //a:blip, Nokogiri doesn’t know what namespace a refers to. You need to specify the namespaces that you are targeting in your query, not just the prefix. The fact that the prefix a is defined in the document doesn’t really matter, it is the actual namespace URI that is important. It is possible to use completely different prefixes in the query than those used in the document, as long as the namespace URIs match.

    You may be wondering why the query //w:drawing works. You don’t include the full XML, but I suspect that the w prefix is defined on the root node (something like xmlns:w="http://some.uri.here"). If you don’t specify any namespaces, Nokogiri will automatically register any defined in the root node so they will be available in your query. The namespace corresponding to the a prefix isn’t defined on the root, so it is unavailable, and so you get the error you see.

    To specify namespaces in Nokogiri you pass a hash, mapping the prefix (as used in the query) to namespace URI, to the xpath method (or which ever query method you’re using). Since you are providing your own namespace mappings, you also need to include any you use from the root node, Nokogiri doesn’t include them in this case.

    In your case, the code would look something like this:

    namespaces = {
      'w' => 'http://some.uri', # whatever the URI is for this namespace
      'a' => 'http://schemas.openxmlformats.org/drawingml/2006/main'
    }
    
    # You can combine this to a single query.
    # Also note you don’t want a double slash infront of
    # the `/a:blip` part, just one.
    xml.xpath('//w:drawing/a:blip', namespaces)
    

    Have a look at the Nokogiri tutorial section on namespaces for more info.