I am using XPath extensively in the past. Currently I am facing a problem, which I am unable so solve.
Constraints
HTML-Markup
<span class="container">
Peter: Lorem Impsum
<i class="divider" role="img" aria-label="|"></i>
Paul Smith: Foo Bar BAZ
<i class="divider" role="img" aria-label="|"></i>
Mary: One Two Three
</span>
Challenge
I want to extract the three coherent strings:
XPath
The following XPath-queries is the best I've come up with after HOURS of research:
XPath-query 1
//span[contains(@class, "container")]
=> Peter: Lorem ImpsumPaul Smith: Foo Bar BAZMary: One Two Three
XPath-query 2
//span[contains(@class, "container")]//text()
Peter: Lorem Impsum Paul Smith: Foo Bar BAZ Mary: One Two Three
Problem
Although it is possible to post-process the resulting string using (PHP) string functions afterwards, I am not able to split it into the correct three chunks: I need an XPath-query which enables me to distinguish the text-nodes correctly.
Is it possible to integrate some "artificial separators" between the text-nodes?
You're expecting too much from XPath 1.0. XPath 1.0, itself, can help you here to select
Then, you'll have to complete your processing outside of XPath (as Mads suggests in the comments).
To understand the limits you're hitting against, your first XPath,
//span[contains(@class, "container")]
selects a nodeset of span
elements. The environment in which XPath 1.0 is operating is showing you (some variation of) the string value of the single such node in your document:
Peter: Lorem ImpsumPaul Smith: Foo Bar BAZMary: One Two Three
But be clear: Your XPath is selecting a nodeset of span
elements, not strings here.
Your second XPath,
//span[contains(@class, "container")]//text()
selects a nodeset of text()
nodes. The environment in which XPath 1.0 is operating is showing the string value of each selected text()
node.
If you could use XPath 2.0, you could directly, within XPath, select a sequence of strings,
//span[contains(@class, "container")]/text()/string()
or you could join them,
string-join(//span[contains(@class, "container")]/text(), "|")
and directly get
Peter: Lorem Impsum
|
Paul Smith: Foo Bar BAZ
|
Mary: One Two Three
or
string-join(//span[contains(@class, "container")]/text()/normalize-space(), "|")
to get
Peter: Lorem Impsum|Paul Smith: Foo Bar BAZ|Mary: One Two Three