Search code examples
pythonxmlxpathlxmltei

xpath for selecting xml elements between two milestones/empty elements


In the following xml file, I have encoded the structure of a text as div elements as well as the layout information (two columns) of the book containing the text using empty pb (page beginning) and cb (column beginning) elements.

XML/TEI input:

<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" schematypens="http://relaxng.org/ns/structure/1.0"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
    <fileDesc>
        <titleStmt>
            <title type="main" xml:lang="en">Testfile</title>
        </titleStmt>
        <publicationStmt>
            <p>Test</p>
        </publicationStmt>
        <sourceDesc>
            <p>Testfile</p></sourceDesc>
    </fileDesc>
</teiHeader>
    
    
    <text>
        <body>
            <pb n="1r"/><fw type="header">Some header</fw>
            <cb n="a"/>
            <lb/><div n="1"><p>Line 1.1
                <lb/>Line 1.2
                <lb/>Line 1.3
                <lb/>Line 1.4
            </p></div>
            <cb n="b"/>
            <lb/><div n="2"><p>Line 2.1
                <lb/>Line 2.2
                <lb/>Line 2.3
                <lb/>Line 2.4
                <pb n="1v"/><fw type="header">Some header</fw>
                <cb n="a"/>
                <lb/>Line 1.1
                <lb/>Line 1.2
                <lb/>Line 1.3
                <lb/>Line 1.4
            </p></div>
            <cb n="b"/>
            <lb/><div n="2"><p>Line 1.1
                <lb/>Line 1.2
                <lb/>Line 1.3
                <lb/>Line 1.4
            </p></div>
        </body>
    </text>
</TEI>

What I want

Now, I want to iterate through the tree using lxml.etree and XPath to select all the lb elements of a column, f.i. all lb elements between <pb n="1r"/><fw type="header">Some header</fw><cb n="a"/> ... and the first <cb n="b"/> element thereafter.

What I have tried

I used the following xpath-expression for that:

//lb[preceding::pb[@n="1r"] and following::cb[@n="b"]]

However, it selects not only the elements expected, but also all other lb elements that are followed by a <cb n="b"/> element.

I have also tried to limit to the first occurrence of <cb n="b"/>, but it did not change the result:

//lb[preceding::pb[@n="1r"] and following::cb[@n="b"][1]]

I have already tried some similar questions such as XPath select all elements between two specific elements, but the suggested answers did not work when selecting the right pb by its @n attribute.

Can someone point me into the right direction how to select only lbs of a given column?

edit: Note: in etree, the namespace tei has to be added to the XPath expression to work with the accepted answer:

root.xpath('.//tei:lb[preceding::tei:pb[@n="2r"] and count(preceding::tei:cb[@n="b"]) = 0]', namespaces = {'tei':'http://www.tei-c.org/ns/1.0'})

Solution

  • Could you try following XPath expression:

    //lb[preceding::pb[@n="1r"] and count(preceding::cb[@n='b']) = 0]
    

    Predicate count(preceding::cb[@n='b']) = 0 should exclude lb elements followed by a element.