i am aware that this questions has been asked for example here: XPath select all elements between two specific elements
but there and in a few other google hits they use hard coded values to select specific data.
what i need would like todo is get a list of text with each parent:
<doc>
<divider />
<p>text</p>
<p>text</p>
<p>text</p>
<p>text</p>
<p>text</p>
<divider />
<p>text</p>
<p>text</p>
<divider />
<p>text</p>
<divider />
</doc>
to get the first text elements you can do:
/*/p[count(preceding-sibling::divider)=1]
but what i want as ouput is something like this:
[['<doc>'], ['<p>text</p>', '<p>text</p>', '<p>text</p>', '<p>text</p>', '<p>text</p>'], ['<p>text</p>', '<p>text</p>'], ['<p>text</p>']]
now you got a list of every text element for divider 1, divider 2, divider x...
which you get from this python code:
data = open("inputfile", 'r')
matches = []
tmp = []
for line in data.readlines():
currentLine = line.strip()
if 'divider' in currentLine:
if len(tmp) > 0:
matches.append(tmp)
tmp = []
else:
tmp.append(currentLine)
print(matches)
yes, theres a 'doc' at the beginning, its just an example, not perfect. so with this code you can also save the parent in the same list, in the testdate thats always divider so i did not do it.
whats the xpath magic for this?
In XPath 3.1 you can use e.g.
array {
let $dividers := //divider
return
for-each-pair($dividers, tail($dividers), function($d1, $d2) {
array { root($d1)//*[. >> $d1 and . << $d2] }
})
}
to return an array of arrays.
Online fiddle using SaxonCHE Python package.
For Python ElementPath also supports XPath 3.1.
Example Python code using SaxonC HE:
from saxonche import PySaxonProcessor
xpath = '''array {
let $dividers := //divider
return
for-each-pair($dividers, tail($dividers), function($d1, $d2) {
array { root($d1)//*[. >> $d1 and . << $d2] }
})
}'''
with PySaxonProcessor(license=False) as saxon:
xpath_processor = saxon.new_xpath_processor()
xpath_processor.set_context(file_name='sample1.xml')
xdm_result = xpath_processor.evaluate(xpath)
print(xdm_result)
Example code using ElementPath:
from elementpath import select
from elementpath.xpath3 import XPath3Parser
xpath = '''array {
let $dividers := //divider
return
for-each-pair($dividers, tail($dividers), function($d1, $d2) {
array { root($d1)//*[. >> $d1 and . << $d2] }
})
}'''
root = ET.parse('sample1.xml')
result = select(root, xpath, parser=XPath3Parser)
print(result)
Actually, for ElementPath, as it already returns the sequence result of select
as a Python list, it might be better not to construct an additional array in XPath but just return a sequence of arrays, as that way it is easier to unwrap the XPath result into a nested Python list of element nodes:
xpath = '''let $dividers := //divider
return
for-each-pair($dividers, tail($dividers), function($d1, $d2) {
array { root($d1)//*[. >> $d1 and . << $d2] }
})
'''
result = select(root, xpath, parser=XPath3Parser)
result_list = [array.items() for array in result]
print(result_list)