Search code examples
xpathxquerybasex

XQuery: look for node with descendants in a certain order


I have an XML file that represents the syntax trees of all the sentences in a book:

<book>
    <sentence>
        <w class="pronoun" role="subject">
            I
        </w>
        <wg type="verb phrase">
            <w class="verb" role="verb">
                like
            </w>
            <wg type="noun phrase" role="object">
                <w class="adj">
                    green
                </w>
                <w class="noun">
                    eggs
                </w>
            </wg>
        </wg>
    </sentence>
    <sentence>
        ...
    </sentence>
    ...
</book>

This example is fake, but the point is that the actual words (the <w> elements) are nested in unpredictable ways based on syntactic relationships.

What I'm trying to do is find <sentence> nodes with <w> children matching particular criteria in a certain order. For example, I may be looking for a sentence with a w[@class='pronoun'] descendant followed by a w[@class='verb'] descendant.

It's easy to find sentences that just contain both descendants, without caring about ordering:

//sentence[descendant::w[criteria1] and descendant::w[criteria2]]

I did manage to figure out this query that does what I want, which looks for a <w> with a following <w> matching the criteria with the same closest <sentence> ancestor:

for $sentence in //sentence
where $sentence[descendant::w[criteria1 and 
    following::w[(ancestor::sentence[1] = $sentence) and criteria2]]]
return ...

...but unfortunately it's very slow, and I'm not sure why.

Is there a non-slow way to search for a node that contains descendants matching criteria in a certain order? I'm using XQuery 3.1 with BaseX. If I can't find a reasonable way to do this with XQuery, plan B is to do post-processing with Python.


Solution

  • The following axis is expensive indeed, as it spans all subsequent nodes of a document that are no descendants and no ancestors.

    The node comparison operators (<<, >>, is) may help you here. In the code example below, it is checked if there is at least one verb that is followed by a noun:

    for $sentence in //sentence
    let $words1 := $sentence//w[@class = 'verb']
    let $words2 := $sentence//w[@class = 'noun']
    where some $w1 in $words1 satisfies 
          some $w2 in $words2 satisfies $w1 << $w2
    return $sentence