Search code examples
xmlxpath

Xpath 2 - select elements between


i am aware that this questions has been asked for example here: XPath select all elements between two specific elements

but there and in a few other google hits they use hard coded values to select specific data.

what i need would like todo is get a list of text with each parent:

<doc>
    <divider />
    <p>text</p>
    <p>text</p>
    <p>text</p>
    <p>text</p>
    <p>text</p>
    <divider />
    <p>text</p>
    <p>text</p>
    <divider />
    <p>text</p>
    <divider />
</doc>

to get the first text elements you can do:

/*/p[count(preceding-sibling::divider)=1]

but what i want as ouput is something like this:

[['<doc>'], ['<p>text</p>', '<p>text</p>', '<p>text</p>', '<p>text</p>', '<p>text</p>'], ['<p>text</p>', '<p>text</p>'], ['<p>text</p>']]

now you got a list of every text element for divider 1, divider 2, divider x...

which you get from this python code:

data = open("inputfile", 'r')

matches = []
tmp = []
for line in data.readlines():
    currentLine = line.strip()
    if 'divider' in currentLine:
        if len(tmp) > 0:
            matches.append(tmp)
            tmp = []
    else:
        tmp.append(currentLine)


print(matches)

yes, theres a 'doc' at the beginning, its just an example, not perfect. so with this code you can also save the parent in the same list, in the testdate thats always divider so i did not do it.

whats the xpath magic for this?


Solution

  • In XPath 3.1 you can use e.g.

    array {
    let $dividers := //divider
    return
      for-each-pair($dividers, tail($dividers), function($d1, $d2) {
        array { root($d1)//*[. >> $d1 and . << $d2] }
      })
    }
    

    to return an array of arrays.

    Online fiddle using SaxonCHE Python package.

    For Python ElementPath also supports XPath 3.1.

    Example Python code using SaxonC HE:

    from saxonche import PySaxonProcessor
    
    xpath = '''array {
      let $dividers := //divider
      return
        for-each-pair($dividers, tail($dividers), function($d1, $d2) {
          array { root($d1)//*[. >> $d1 and . << $d2] }
        })
    }'''
    
    with PySaxonProcessor(license=False) as saxon:
        xpath_processor = saxon.new_xpath_processor()
    
        xpath_processor.set_context(file_name='sample1.xml')
    
        xdm_result = xpath_processor.evaluate(xpath)
    
        print(xdm_result)
    

    Example code using ElementPath:

    from elementpath import select
    from elementpath.xpath3 import XPath3Parser
    
    xpath = '''array {
      let $dividers := //divider
      return
        for-each-pair($dividers, tail($dividers), function($d1, $d2) {
          array { root($d1)//*[. >> $d1 and . << $d2] }
        })
    }'''
    
    root = ET.parse('sample1.xml')
    
    result = select(root, xpath, parser=XPath3Parser)
    
    print(result)
    

    Actually, for ElementPath, as it already returns the sequence result of select as a Python list, it might be better not to construct an additional array in XPath but just return a sequence of arrays, as that way it is easier to unwrap the XPath result into a nested Python list of element nodes:

    xpath = '''let $dividers := //divider
      return
        for-each-pair($dividers, tail($dividers), function($d1, $d2) {
          array { root($d1)//*[. >> $d1 and . << $d2] }
        })
    '''
    result = select(root, xpath, parser=XPath3Parser)
    
    result_list = [array.items() for array in result]
    
    print(result_list)