I have the following xmllint
example selecting an element:
$ curl -s http://lists.opencsw.org/pipermail/users/2015-January/date.html |
xmllint --html --xpath '/html/body/p/b[contains(., "Messages:")]' -
<b>Messages:</b>
Behind the bold element is the number of messages I am interested in. It is shown, when I use the parent
axis:
$ curl -s http://lists.opencsw.org/pipermail/users/2015-January/date.html |
xmllint --html --xpath '/html/body/p/b[contains(., "Messages:")]/parent::*' -
<p><b>Starting:</b> <i>Thu Jan 1 23:17:09 CET 2015</i><br><b>Ending:</b> <i>Sat Jan 31 14:51:07 CET 2015</i><br><b>Messages:</b> 28</p>
I thought that the following-sibling
axis might give me exactly this number, but it does not do so:
$ curl -s http://lists.opencsw.org/pipermail/users/2015-January/date.html |
xmllint --html --xpath '/html/body/p/b[contains(., "Messages:")]/following-sibling::*' -
XPath set is empty
This text node you are after is indeed a following sibling, but it's a text node, not an element node. An expression like
following-sibling::*
Only looks for following siblings that are elements. To match text nodes, use text()
:
$ curl -s http://lists.opencsw.org/pipermail/users/2015-January/date.html |
xmllint --html --xpath '/html/body/p/b[contains(., "Messages:")]/following-sibling::text()'
The commands above do not work on my computer, using bash on Mac OS X - but I trust it works for you. If I first save the result from curl
and then use
$ xmllint example.html --html --xpath '/html/body/p/b[contains(., "Messages:")]/following-sibling::text()'
The result is _28
. That's not really an underscore, but a whitespace that I wanted to point to. To remove the leading whitespace, use
$ xmllint example.html --html --xpath 'normalize-space(/html/body/p/b[contains(., "Messages:")]/following-sibling::text())'
And no, using regex is not really an option.