Search code examples
xpathdomxpathxpath-1.0xpathquery

Improve XPath-query to distinguish text-nodes correctly


I am using XPath extensively in the past. Currently I am facing a problem, which I am unable so solve.

Constraints

  • pure XPath 1.0
  • no aux-functions (e.g. no "concat()")

HTML-Markup

<span class="container">
    Peter: Lorem Impsum
    <i class="divider" role="img" aria-label="|"></i>
    Paul Smith: Foo Bar BAZ
    <i class="divider" role="img" aria-label="|"></i>
    Mary: One Two Three
</span>

Challenge

I want to extract the three coherent strings:

  • Peter: Lorem Impsum
  • Paul Smith: Foo Bar BAZ
  • Mary: One Two Three

XPath

The following XPath-queries is the best I've come up with after HOURS of research:

XPath-query 1

//span[contains(@class, "container")]

=> Peter: Lorem ImpsumPaul Smith: Foo Bar BAZMary: One Two Three

XPath-query 2

//span[contains(@class, "container")]//text()

Peter: Lorem Impsum Paul Smith: Foo Bar BAZ Mary: One Two Three

Problem

Although it is possible to post-process the resulting string using (PHP) string functions afterwards, I am not able to split it into the correct three chunks: I need an XPath-query which enables me to distinguish the text-nodes correctly.

Is it possible to integrate some "artificial separators" between the text-nodes?


Solution

  • You're expecting too much from XPath 1.0. XPath 1.0, itself, can help you here to select

    1. a string, or
    2. a set of text nodes

    Then, you'll have to complete your processing outside of XPath (as Mads suggests in the comments).

    To understand the limits you're hitting against, your first XPath,

    //span[contains(@class, "container")]
    

    selects a nodeset of span elements. The environment in which XPath 1.0 is operating is showing you (some variation of) the string value of the single such node in your document:

    Peter: Lorem ImpsumPaul Smith: Foo Bar BAZMary: One Two Three
    

    But be clear: Your XPath is selecting a nodeset of span elements, not strings here.

    Your second XPath,

    //span[contains(@class, "container")]//text()
    

    selects a nodeset of text() nodes. The environment in which XPath 1.0 is operating is showing the string value of each selected text() node.

    If you could use XPath 2.0, you could directly, within XPath, select a sequence of strings,

    //span[contains(@class, "container")]/text()/string()
    

    or you could join them,

    string-join(//span[contains(@class, "container")]/text(), "|")
    

    and directly get

    Peter: Lorem Impsum
    |
    Paul Smith: Foo Bar BAZ
    |
    Mary: One Two Three
    

    or

    string-join(//span[contains(@class, "container")]/text()/normalize-space(), "|")
    

    to get

    Peter: Lorem Impsum|Paul Smith: Foo Bar BAZ|Mary: One Two Three