Search code examples
xmlbashshellxpathxmllint

XPath expression to get node based on attribute value


I have the following input xml file:

<rootnode>
 <section id="1" status="fail">
  <outer status="fail">
   <inner status="fail"/>
   <inner status="pass"/>
  </outer>
  <outer status="pass">
   <inner status="pass"/>
  </outer>
  <outer status="pass"/>
  <outer status="fail"/>
 </section>
 <section id="2" status="fail">
  <outer status="fail">
   <inner status="pass"/>
   <inner status="fail"/>
   <inner status="inc"/>
  </outer>
 </section>
</rootnode>

I want to filter out all non-fail status nodes so that the result looks like this:

<rootnode>
 <section id="1" status="fail">
  <outer status="fail">
   <inner status="fail"/>
  </outer>
  <outer status="fail"/>
 </section>
 <section id="2" status="fail">
  <outer status="fail">
   <inner status="fail"/>
  </outer>
 </section>
</rootnode>

The <rootnode> must not necessarily be included in the result. I have tried to use xmllint with an xpath expression. I can extract specific nodes with

xmllint --xpath "//inner" input.xml
xmllint --xpath "//@status" input.xml

but they only either return the nodes without regard to the value of status or the only return the attribute without the surrounding nodes.

Is there a way to do this with an xpath expression? If not, a simple solution which incorporates other bash tools is fine, too.


Solution

  • Like @svasa said in a comment, you should use XSLT. You can easily process the XSLT in bash with xsltproc, xmlstarlet (using tr command), Saxon (java on the command line), etc.

    Here's an example using xsltproc:

    $ xsltproc so.xsl so.xml
    <?xml version="1.0"?>
    <rootnode>
      <section id="1" status="fail">
        <outer status="fail">
          <inner status="fail"/>
        </outer>
        <outer status="fail"/>
      </section>
      <section id="2" status="fail">
        <outer status="fail">
          <inner status="fail"/>
        </outer>
      </section>
    </rootnode>
    

    XML Input (so.xml)

    <rootnode>
        <section id="1" status="fail">
            <outer status="fail">
                <inner status="fail"/>
                <inner status="pass"/>
            </outer>
            <outer status="pass">
                <inner status="pass"/>
            </outer>
            <outer status="pass"/>
            <outer status="fail"/>
        </section>
        <section id="2" status="fail">
            <outer status="fail">
                <inner status="pass"/>
                <inner status="fail"/>
                <inner status="inc"/>
            </outer>
        </section>
    </rootnode>
    

    XSLT 1.0 (so.xsl)

    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
      <xsl:output indent="yes"/>
      <xsl:strip-space elements="*"/>
    
      <xsl:template match="@*|node()">
        <xsl:copy>
          <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
      </xsl:template>
    
      <xsl:template match="*[@status[not(normalize-space()='fail')]]"/>
    
    </xsl:stylesheet>
    

    I have a small follow-up question, if you don't mind. When the input.xml file does not contain any status=fail nodes, then the output is just two lines: <?xml version="1.0"?> and <rootnode/>. Is it possible two suppress the output entirely in this case? It is not really a problem, I know how to work around it in bash. I am just interested if there is a clean solution via xslt.

    What you could do is omit the XML declaration (omit-xml-declaration="yes" in xsl:output) and check to see if there are any elements with status="fail". I'd use a key (xsl:key) for this...

    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
      <xsl:output indent="yes" omit-xml-declaration="yes">
        <!--If you need to output the declaration when there
        are elements with status="fail", it might be best to post process files that
        only contain the xml declaration.-->
      </xsl:output>
      <xsl:strip-space elements="*"/>
    
      <!--Key of all elements with status="fail".-->  
      <xsl:key name="fails" match="*[@status='fail']" use="@status"/>
    
      <xsl:template match="/*[not(key('fails','fail'))]">
        <!--If there aren't any elements with status="fail", don't process
        anything else.-->
      </xsl:template>
    
      <xsl:template match="@*|node()">
        <xsl:copy>
          <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
      </xsl:template>
    
      <xsl:template match="*[@status[not(normalize-space()='fail')]]"/>
    
    </xsl:stylesheet>