Search code examples
bashxpathxmllint

How to select the text behind an element?


I have the following xmllint example selecting an element:

$ curl -s http://lists.opencsw.org/pipermail/users/2015-January/date.html |
xmllint --html --xpath '/html/body/p/b[contains(., "Messages:")]' -
<b>Messages:</b>

Behind the bold element is the number of messages I am interested in. It is shown, when I use the parent axis:

$ curl -s http://lists.opencsw.org/pipermail/users/2015-January/date.html |
xmllint --html --xpath '/html/body/p/b[contains(., "Messages:")]/parent::*' -
<p><b>Starting:</b> <i>Thu Jan  1 23:17:09 CET 2015</i><br><b>Ending:</b> <i>Sat Jan 31 14:51:07 CET 2015</i><br><b>Messages:</b> 28</p>

I thought that the following-sibling axis might give me exactly this number, but it does not do so:

$ curl -s http://lists.opencsw.org/pipermail/users/2015-January/date.html |
xmllint --html --xpath '/html/body/p/b[contains(., "Messages:")]/following-sibling::*' -
XPath set is empty

Solution

  • This text node you are after is indeed a following sibling, but it's a text node, not an element node. An expression like

    following-sibling::*
    

    Only looks for following siblings that are elements. To match text nodes, use text():

    $ curl -s http://lists.opencsw.org/pipermail/users/2015-January/date.html |
    xmllint --html --xpath '/html/body/p/b[contains(., "Messages:")]/following-sibling::text()'
    

    The commands above do not work on my computer, using bash on Mac OS X - but I trust it works for you. If I first save the result from curl and then use

    $ xmllint example.html --html --xpath '/html/body/p/b[contains(., "Messages:")]/following-sibling::text()'
    

    The result is _28. That's not really an underscore, but a whitespace that I wanted to point to. To remove the leading whitespace, use

    $ xmllint example.html --html --xpath 'normalize-space(/html/body/p/b[contains(., "Messages:")]/following-sibling::text())'
    

    And no, using regex is not really an option.