Search code examples
xmlxsltxslt-2.0

Check/Resolve cross-references in separate xml files


Starting point

Let's say we have a book in xml format. This book consists of many assets and these can reference each other by a tag ref-asset with attribute path. [Path-Mask: {id}|{version} of target-asset].

Important: Assets are single files and there is no merged, complete file.

Exemplary XML (merged for better visual view)

<book>
    <!-- file a.xml -->
    <asset id="1" version="1.0">
        <name>Prolog</name>
    </asset>
    <!-- file b.xml -->
    <asset id="2" version="2">
        <name>Table of content</name>
        <list>
            <item><ref-asset path="1|1.0">Prolog</ref-asset></item>
            <item><ref-asset path="2|2.0">Table of content</ref-asset></item>
            <item><ref-asset path="3|1.1">FooBar</ref-asset></item>
        </list>
    </asset>
    <!-- file c.xml -->
    <asset id="3" version="1.1">
        <name>FooBar</name>
    </asset>
</book>

Request

  • Check all ref-asset if linked target is in book.
  • Create report about results [exists, not exists, asset exists but wrong version, ...]
  • [in addition: Replace the reference with the content of target.]

Settings

  • Saxon 9.6.x EE XSLT 2.0
  • Java
  • 100 up to x thousand single documents (combined filesize: upper 3 digit Mb)

How to solve

First attempt function collection() + function document():

Search for all single asset-files on filesystem via collection(), load them into process via document() and search for matching hits.

Second attempt Merged, complete File:

Merge all single assets into book and match via xsl:key or similiar techniques.


Question(s)

  • Is collection() capable of loading thousands of documents and still perform well with a followed document() to process the asset?
  • How to "index" run-timed loaded documents [still via xsl:key?] to search efficiently?

Further hints are highly appreciated / No specific stylsheet needed [i will do it on my own, as soon as i know what way to go].


EDITs: collection() returns already a sequence of document nodes, so document() might be unnecessary.


Solution

  • Questions about performance are always product-dependent, so it would be easier to answer if the question were Saxon-specific.

    I have often used the collection() function in Saxon to process thousands of input documents, and yes, it is quite capable of doing this. In Saxon-EE, collection() is multi-threaded so you can be parsing multiple documents in parallel on a multi-core machine.

    Indexing is a bit tricky because the key() function can only search one document. We studied a very similar problem during the performance workshop at the Oxford XML Summer School a couple of weeks ago, and solved the problem (getting a ten-fold speed-up) by using the new XSLT 3.0 feature of maps. Something like this:

    <xsl:variable name="index" as="map(xs:string, element(asset))">
      <xsl:map>
        <xsl:for-each select="collection('....')/asset">
          <xsl:map-entry key="@id || '|' || @version"
                         select="."/>
        </xsl:for-each>
      </xsl:map>
    </xsl:variable>
    
    <xsl:template match="ref-asset">
      <xsl:variable name="asset" select="$index(@path)"/>
      ....
    </xsl:template>