I'm facing an interesting problem, where I need to go through multiple big XML files (each is hundreds of MBs) and output specific data from each of the elements, and do it as fast as possible. Examples:
Points.xml:
<points>
<point>
<identifier>bb25c66c-27d0-447f-aaad-bd8290b332fd</identifier>
<name>A</name>
</point>
<point>
<identifier>f187cc74-2709-4464-995c-b3bdcae46b39</identifier>
<name>B</name>
</point>
</points>
Routes.xml:
<routes>
<route>
<pointLink xlink:href="urn:uuid:bb25c66c-27d0-447f-aaad-bd8290b332fd"/>
<name>1</name>
</route>
<route>
<pointLink xlink:href="urn:uuid:f187cc74-2709-4464-995c-b3bdcae46b39"/>
<name>2</name>
</route>
</routes>
There are tens of thousands of point/route elements in different documents and they are much more complicated than this, but for this example, this excerpt will suffice.
The output needs to be something like this:
1 - A
2 - B
I cannot modify the documents themselves, this is what I have to work with, and now the question is - how to link the elements based on the identifier as efficiently as possible? I'm talking about the way to do this, maybe a different way to think about this, not the actual code, since it will be quite different in the end anyway.
I have tried looping over the routes and then finding the points using FLWOR:
for $route in doc('routes.xml')/routes/route
return concat(
$route/name/text(),
' - ',
doc('points.xml')/points/point[./identifier/text() = substring-after($route/pointLink/@xlink:href, 'urn:uuid:')]/name/text()
)
That didn't work out really well (took almost an hour to complete). Similar story with this approach too:
for $route in doc('routes.xml')/routes/route,
$point in doc('points.xml')/points/point[./identifier/text() = substring-after($route/pointLink/@xlink:href, 'urn:uuid:')]
return concat(
$route/name/text(),
' - ',
$point/name/text()
)
In the end, I will need to use more sub elements from point/route in the output so I think I have to iterate over them with for and then concat the output, but maybe I am wrong, which is why I am asking here.
Is there something I am overlooking, or is there simply no faster way to do this?
The problem was indeed indexing, as Martin Honnen said in comment. Simply creating an attribute index (CREATE INDEX attribute) helped to reduce the query time from ~45 minutes to less than a second. Incredible.