Xpath 2 - select elements between

i am aware that this questions has been asked for example here: XPath select all elements between two specific elements

but there and in a few other google hits they use hard coded values to select specific data.

what i need would like todo is get a list of text with each parent:

<doc>
    <divider />
    <p>text</p>
    <p>text</p>
    <p>text</p>
    <p>text</p>
    <p>text</p>
    <divider />
    <p>text</p>
    <p>text</p>
    <divider />
    <p>text</p>
    <divider />
</doc>

to get the first text elements you can do:

/*/p[count(preceding-sibling::divider)=1]

but what i want as ouput is something like this:

[['<doc>'], ['<p>text</p>', '<p>text</p>', '<p>text</p>', '<p>text</p>', '<p>text</p>'], ['<p>text</p>', '<p>text</p>'], ['<p>text</p>']]

now you got a list of every text element for divider 1, divider 2, divider x...

which you get from this python code:

data = open("inputfile", 'r')

matches = []
tmp = []
for line in data.readlines():
    currentLine = line.strip()
    if 'divider' in currentLine:
        if len(tmp) > 0:
            matches.append(tmp)
            tmp = []
    else:
        tmp.append(currentLine)


print(matches)

yes, theres a 'doc' at the beginning, its just an example, not perfect. so with this code you can also save the parent in the same list, in the testdate thats always divider so i did not do it.

whats the xpath magic for this?

Solution

In XPath 3.1 you can use e.g.

array {
let $dividers := //divider
return
  for-each-pair($dividers, tail($dividers), function($d1, $d2) {
    array { root($d1)//*[. >> $d1 and . << $d2] }
  })
}

to return an array of arrays.

Online fiddle using SaxonCHE Python package.

For Python ElementPath also supports XPath 3.1.

Example Python code using SaxonC HE:

from saxonche import PySaxonProcessor

xpath = '''array {
  let $dividers := //divider
  return
    for-each-pair($dividers, tail($dividers), function($d1, $d2) {
      array { root($d1)//*[. >> $d1 and . << $d2] }
    })
}'''

with PySaxonProcessor(license=False) as saxon:
    xpath_processor = saxon.new_xpath_processor()

    xpath_processor.set_context(file_name='sample1.xml')

    xdm_result = xpath_processor.evaluate(xpath)

    print(xdm_result)

Example code using ElementPath:

from elementpath import select
from elementpath.xpath3 import XPath3Parser

xpath = '''array {
  let $dividers := //divider
  return
    for-each-pair($dividers, tail($dividers), function($d1, $d2) {
      array { root($d1)//*[. >> $d1 and . << $d2] }
    })
}'''

root = ET.parse('sample1.xml')

result = select(root, xpath, parser=XPath3Parser)

print(result)

Actually, for ElementPath, as it already returns the sequence result of select as a Python list, it might be better not to construct an additional array in XPath but just return a sequence of arrays, as that way it is easier to unwrap the XPath result into a nested Python list of element nodes:

xpath = '''let $dividers := //divider
  return
    for-each-pair($dividers, tail($dividers), function($d1, $d2) {
      array { root($d1)//*[. >> $d1 and . << $d2] }
    })
'''
result = select(root, xpath, parser=XPath3Parser)

result_list = [array.items() for array in result]

print(result_list)