Search code examples
xmlmacosxpathxmllint

How to select matches separately from xml by xpath on macOS


I want to get all text contents from an XML file matching some selector.

I chose to use XPath selector because I already have xmllint installed on my Mac (but it is older than version 20909 which apparently has the behaviour I want by default)

$ xmllint --version
xmllint: using libxml version 20904
   compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude ICU ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib 

Here is my xml

<?xml version="1.0" encoding="utf-8"?>
<xml>
  <foo bar="baz">Lorem</foo>
  <foo bar="baz">Ipsum</foo>
  <foo bar="baz">Dolor</foo>
  <foo bar="qux">Sit</foo>
  <foo bar="baz">Amet</foo>
</xml>

I want to get each text content of foo elements that have a certain attribute value

$ xmllint --xpath '//foo[@bar="baz"]/text()' my.xml
LoremIpsumDolorAmet

The output is not newline-delimited, nor does it seem to be NUL-delimited:

$ xmllint --xpath '//foo[@bar="baz"]//text()' my.xml | od -A n -t x1
           4c  6f  72  65  6d  49  70  73  75  6d  44  6f  6c  6f  72  41
           6d  65  74  

How can I present the output such that matches are separated from each other by a newline, using macOS?


Solution

  • It can be done with xpath --shell as follows. If XML file is not too big, it can be optimized to load it in memory.

    cnt=$(xmllint --xpath 'count(//foo[@bar="baz"])' test.xml)
    (for i in $(seq 1 $cnt); do echo "cat //foo[@bar='baz'][$i]/text()"; done) | xmllint --shell test.xml | grep -Ev '\/ [<>]( cat)?| -------'
    

    Result:

    Lorem
    Ipsum
    Dolor
    Amet
    

    Without the grep at the end it produces

    / > cat //foo[@bar='baz'][1]/text()
     -------
    Lorem
    / > cat //foo[@bar='baz'][2]/text()
     -------
    Ipsum
    / > cat //foo[@bar='baz'][3]/text()
     -------
    Dolor
    / > cat //foo[@bar='baz'][4]/text()
     -------
    Amet
    / >
    

    A different version worth adding to the answer

    cnt=4; (for i in $(seq 1 $cnt); do echo "cd //foo[@bar='baz'][$i]/text()"; echo "cat"; done) | xmllint --shell test.xml | grep -Ev ' > (cat|cd)?'
    

    Without the grep

    / > cd //foo[@bar='baz'][1]/text()
    text > cat
    Lorem
    text > cd //foo[@bar='baz'][2]/text()
    text > cat
    Ipsum
    text > cd //foo[@bar='baz'][3]/text()
    text > cat
    Dolor
    text > cd //foo[@bar='baz'][4]/text()
    text > cat
    Amet
    text >