In the following xml file, I have encoded the structure of a text as div elements as well as the layout information (two columns) of the book containing the text using empty pb
(page beginning) and cb
(column beginning) elements.
XML/TEI input:
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" schematypens="http://relaxng.org/ns/structure/1.0"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
<fileDesc>
<titleStmt>
<title type="main" xml:lang="en">Testfile</title>
</titleStmt>
<publicationStmt>
<p>Test</p>
</publicationStmt>
<sourceDesc>
<p>Testfile</p></sourceDesc>
</fileDesc>
</teiHeader>
<text>
<body>
<pb n="1r"/><fw type="header">Some header</fw>
<cb n="a"/>
<lb/><div n="1"><p>Line 1.1
<lb/>Line 1.2
<lb/>Line 1.3
<lb/>Line 1.4
</p></div>
<cb n="b"/>
<lb/><div n="2"><p>Line 2.1
<lb/>Line 2.2
<lb/>Line 2.3
<lb/>Line 2.4
<pb n="1v"/><fw type="header">Some header</fw>
<cb n="a"/>
<lb/>Line 1.1
<lb/>Line 1.2
<lb/>Line 1.3
<lb/>Line 1.4
</p></div>
<cb n="b"/>
<lb/><div n="2"><p>Line 1.1
<lb/>Line 1.2
<lb/>Line 1.3
<lb/>Line 1.4
</p></div>
</body>
</text>
</TEI>
What I want
Now, I want to iterate through the tree using lxml.etree and XPath to select all the lb
elements of a column, f.i. all lb
elements between
<pb n="1r"/><fw type="header">Some header</fw><cb n="a"/>
... and the first <cb n="b"/>
element thereafter.
What I have tried
I used the following xpath-expression for that:
//lb[preceding::pb[@n="1r"] and following::cb[@n="b"]]
However, it selects not only the elements expected, but also all other lb
elements that are followed by a <cb n="b"/>
element.
I have also tried to limit to the first occurrence of <cb n="b"/>
, but it did not change the result:
//lb[preceding::pb[@n="1r"] and following::cb[@n="b"][1]]
I have already tried some similar questions such as XPath select all elements between two specific elements, but the suggested answers did not work when selecting the right pb
by its @n
attribute.
Can someone point me into the right direction how to select only lbs of a given column?
edit:
Note: in etree, the namespace tei
has to be added to the XPath expression to work with the accepted answer:
root.xpath('.//tei:lb[preceding::tei:pb[@n="2r"] and count(preceding::tei:cb[@n="b"]) = 0]', namespaces = {'tei':'http://www.tei-c.org/ns/1.0'})
Could you try following XPath expression:
//lb[preceding::pb[@n="1r"] and count(preceding::cb[@n='b']) = 0]
Predicate count(preceding::cb[@n='b']) = 0
should exclude lb
elements followed by a element.