Let's say we have a book in xml format. This book consists of many assets and these can reference each other by a tag ref-asset
with attribute path
. [Path-Mask: {id}|{version} of target-asset].
Important: Assets are single files and there is no merged, complete file.
Exemplary XML (merged for better visual view)
<book>
<!-- file a.xml -->
<asset id="1" version="1.0">
<name>Prolog</name>
</asset>
<!-- file b.xml -->
<asset id="2" version="2">
<name>Table of content</name>
<list>
<item><ref-asset path="1|1.0">Prolog</ref-asset></item>
<item><ref-asset path="2|2.0">Table of content</ref-asset></item>
<item><ref-asset path="3|1.1">FooBar</ref-asset></item>
</list>
</asset>
<!-- file c.xml -->
<asset id="3" version="1.1">
<name>FooBar</name>
</asset>
</book>
ref-asset
if linked target is in book
.First attempt function collection() + function document():
Search for all single asset-files on filesystem via collection(), load them into process via document() and search for matching hits.
Second attempt Merged, complete File:
Merge all single assets
into book
and match via xsl:key
or similiar techniques.
collection()
capable of loading thousands of documents and still perform well with a followed document()
to process the asset?xsl:key
?] to search efficiently?Further hints are highly appreciated / No specific stylsheet needed [i will do it on my own, as soon as i know what way to go].
EDITs: collection()
returns already a sequence of document nodes, so document()
might be unnecessary.
Questions about performance are always product-dependent, so it would be easier to answer if the question were Saxon-specific.
I have often used the collection() function in Saxon to process thousands of input documents, and yes, it is quite capable of doing this. In Saxon-EE, collection() is multi-threaded so you can be parsing multiple documents in parallel on a multi-core machine.
Indexing is a bit tricky because the key() function can only search one document. We studied a very similar problem during the performance workshop at the Oxford XML Summer School a couple of weeks ago, and solved the problem (getting a ten-fold speed-up) by using the new XSLT 3.0 feature of maps. Something like this:
<xsl:variable name="index" as="map(xs:string, element(asset))">
<xsl:map>
<xsl:for-each select="collection('....')/asset">
<xsl:map-entry key="@id || '|' || @version"
select="."/>
</xsl:for-each>
</xsl:map>
</xsl:variable>
<xsl:template match="ref-asset">
<xsl:variable name="asset" select="$index(@path)"/>
....
</xsl:template>