Using BaseX 9.7.3, I have a sorted list of names that has been produced using a tumbling window
clause.
A snippet of the data looks like this:
<data>
<group>
<key id="0c7b0bca-0349-489c-b45f-2612f3134a76">ovid</key>
<key id="f77ab9c2-0be3-4348-809d-ab245e630f81">ovid 43 b c-17 or 18 a d</key>
</group>
<group>
<key id="39b9d6c2-85a5-4c72-a83e-2a52e548fc3b">ovid 43 bc</key>
<key id="acf5b3c0-8fd4-4e0c-950b-a40683bab431">ovid 43 bc-17 ad</key>
<key id="cc57be53-9ca8-4b5e-97cf-1aeca798cded">ovid 43 bc-17 ad or 18 a</key>
<key id="8395e750-1e52-4152-9d37-8c8f4e389fd3">ovid 43 bc-17 ad or 18 ad</key>
</group>
<group>
<key id="0be07fc6-d9bf-4d56-8352-1885b4dd6574">ovid 43 bc-17 or 18</key>
<key id="e3aafc69-56b0-4632-a96c-26ca448c6c2d">ovid 43 bc-17 or 18 ad</key>
</group>
<group>
<key id="f9615365-4a32-442b-9e20-9c5abb0e6fa0">ovide</key>
<key id="c7b45a8d-79a3-4e79-b32b-8d918f67a7b0">ovide 0043 av j-c-0017</key>
</group>
</data>
I would like to further group the data so that, in this example, a group would begin with "ovid" and end with "ovid 43 bc-17 or 18 ad."
Desired output:
<data>
<group>
<key id="0c7b0bca-0349-489c-b45f-2612f3134a76">ovid</key>
<key id="f77ab9c2-0be3-4348-809d-ab245e630f81">ovid 43 b c-17 or 18 a d</key>
<key id="39b9d6c2-85a5-4c72-a83e-2a52e548fc3b">ovid 43 bc</key>
<key id="acf5b3c0-8fd4-4e0c-950b-a40683bab431">ovid 43 bc-17 ad</key>
<key id="cc57be53-9ca8-4b5e-97cf-1aeca798cded">ovid 43 bc-17 ad or 18 a</key>
<key id="8395e750-1e52-4152-9d37-8c8f4e389fd3">ovid 43 bc-17 ad or 18 ad</key>
<key id="0be07fc6-d9bf-4d56-8352-1885b4dd6574">ovid 43 bc-17 or 18</key>
<key id="e3aafc69-56b0-4632-a96c-26ca448c6c2d">ovid 43 bc-17 or 18 ad</key>
</group>
<group>
<key id="f9615365-4a32-442b-9e20-9c5abb0e6fa0">ovide</key>
<key id="c7b45a8d-79a3-4e79-b32b-8d918f67a7b0">ovide 0043 av j-c-0017</key>
</group>
</data>
I have the following query, but it simply reproduces the input document:
<data>{
for tumbling window $entry in /*/group/key
start $s at $sp previous $sprev next $snext when starts-with($snext, $s)
end $e at $ep next $enext when not(starts-with($enext, $e))
return
<group>{
for $k in $entry
return (
<key id="{$k/@id}">{data($k)}</key>
)
}</group>
}</data>
Is it possible to compare the start item of the first group ("ovid") to subsequent entries that start with that token? I want to exclude "ovide," even though it starts with "ovid."
With extended (Java like) regular expressions as supported in Saxon I think
for tumbling window $w in /data/group/key
start $s when true()
end next $n when not(matches($n, '^' || $s || '\b', ';j'))
return
<group>{$w}</group>
gives the two groups you want.
I have now also checked that the ';j'
flag works with BaseX 9.7.2 as well.