Search code examples
xmlxsltxpathsaxonxmlspy

XSLT streaming with xsl:iterate correct way


I wanted to process a 161mo database, but java saxon9he run out of memory at 300mb of ram and the .NET at 1700mb ram, so I need to use streaming, so I use XMLSpy demo, but I still don't understand the xpath expressions child parent logic. I am on windows xp sp3 32bit 4gb of ram.

    <xsl:iterate select="db_entry">
        <xsl:apply-templates select="db_entry"/>
    </xsl:iterate>

What the correct way to stream this with xsl:iterate or maybe xsl:for-each is sufficiente ? There is nearly 60000 entries in this database. I mean how to correctly write this because a db_entry within a db_entry does not work.

EDIT:

<xsl:template match="databank_export">
<xsl:iterate select="db_entry">
    <xsl:apply-templates select="public_data"/>
    <xsl:text> |</xsl:text>
    <xsl:apply-templates select="text_data"/>
    <xsl:text> |</xsl:text>
    <xsl:apply-templates select="research_data"/>
    <xsl:text>&#10;</xsl:text>
</xsl:iterate>
</xsl:template>

I replace the db_entry xsl:template by xsl:iterate but then XMLspy can't load the big file so it appears that streaming doesn't work. Am I doing it right or is it program limitations or demo limitations ?

2nd EDIT: I'll put here my entire xsl code:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:math="http://www.w3.org/2005/xpath-functions/math" xmlns:array="http://www.w3.org/2005/xpath-functions/array" xmlns:map="http://www.w3.org/2005/xpath-functions/map" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:err="http://www.w3.org/2005/xqt-errors" exclude-result-prefixes="array fn map math xhtml xs err" version="3.0">
    <xsl:output method="text" encoding="UTF-8" indent="yes"/>
    <xsl:mode streamable="yes"/>
    <!--
    <xsl:template match="databank_export">
    -->
    <xsl:template match="/">
        <xsl:apply-templates select="databank_export/copy-of(db_entry)" mode="entry"/>
    </xsl:template>
    <xsl:template match="db_entry" mode="entry">
        <xsl:value-of select="public_data, text_data, research_data" separator=" |"/>
        <xsl:text>&#10;</xsl:text>
    </xsl:template>
    <xsl:template match="public_data">
        <xsl:value-of select="sflname"/>
        <xsl:text>; </xsl:text>
        <xsl:apply-templates select="bdata"/>
        <xsl:text>; </xsl:text>
        <xsl:value-of select="gender"/>
        <xsl:text>; PHOTO : |</xsl:text>
        <xsl:value-of select="name, gender, rating, datatype/@sdatatype, datatype/@sdatasource, bdata/sbdate, bdata/sbdate/@ccalendar" separator=" - "/>
        <xsl:text>|</xsl:text>
        <xsl:value-of select="bdata/sbtime, bdata/sbtime/@sbtime_ampm, bdata/sbtime/@ctimetype, bdata/sbtime/@stimetype, bdata/sbtime/@stmerid, bdata/sbtime/@ctzauto, bdata/sbtime/@jd_ut, bdata/sbtime/@sznabbr, bdata/sbtime/@time_unknown, bdata/sbtime/@itimeaac, bdata/sbtime/@stimeaac" separator=" "/>
        <xsl:text>|</xsl:text>
        <xsl:value-of select="bdata/place, bdata/country, bdata/country/@sctr" separator=", "/>
        <xsl:text>, </xsl:text>
        <xsl:value-of select="bdata/place/@slati, bdata/place/@slong" separator=" "/>
        <xsl:text>|</xsl:text>
        <xsl:value-of select="scollector, seditor, biographer" separator=" "/>
        <xsl:text>|</xsl:text>
        <xsl:value-of select="screationdate, slasteditdate" separator=" "/>
    </xsl:template>
    <xsl:template match="bdata">
        <xsl:value-of select="sbdate/@iday, sbdate/@imonth, sbdate/@iyear" separator="."/>
        <xsl:text>; </xsl:text>
        <xsl:value-of select="sbtime"/>
        <xsl:text>; </xsl:text>
        <xsl:analyze-string select="sbtime/@stmerid" regex="([hm]{{1}})([0-9]{{1,2}})([ew]{{1}})([0-9]{{0,2}})">
            <xsl:matching-substring>
                <xsl:choose>
                    <xsl:when test="regex-group(3) = 'e'">
                        <xsl:text>+</xsl:text>
                    </xsl:when>
                    <xsl:when test="regex-group(3) = 'w'">
                        <xsl:text>-</xsl:text>
                    </xsl:when>
                    <xsl:otherwise>
                        <xsl:text>+</xsl:text>
                    </xsl:otherwise>
                </xsl:choose>
                <xsl:choose>
                    <xsl:when test="regex-group(1) = 'h'">
                        <xsl:number value="regex-group(2)" format="01"/>
                    </xsl:when>
                    <xsl:when test="regex-group(1) = 'm'">
                        <xsl:text>00:</xsl:text>
                        <xsl:number value="regex-group(2)" format="01"/>
                        <xsl:text>:</xsl:text>
                        <xsl:number value="regex-group(4)" format="01"/>
                    </xsl:when>
                    <xsl:otherwise>
                        <xsl:text>+1</xsl:text>
                    </xsl:otherwise>
                </xsl:choose>
            </xsl:matching-substring>
        </xsl:analyze-string>
        <xsl:text>; </xsl:text>
        <xsl:value-of select="place, country" separator=","/>
        <xsl:text>; </xsl:text>
        <xsl:value-of select="place/@slati, place/@slong" separator="; "/>
    </xsl:template>
    <xsl:template match="text_data">
        <xsl:value-of select="shortbiography, wikipedia_link, db_link, sourcenotes" separator="|"/>
    </xsl:template>
    <xsl:template match="research_data">
        <xsl:apply-templates select="categories"/>
        <xsl:text>|</xsl:text>
        <xsl:apply-templates select="relationships"/>
        <xsl:text>|</xsl:text>
        <xsl:value-of select="events/@count"/>
        <xsl:text>|</xsl:text>
        <xsl:apply-templates select="events"/>
    </xsl:template>
    <xsl:template match="categories">
        <xsl:iterate select="category">
            <xsl:value-of select="@cat_id, @db_id, @catnotes" separator=" "/>
            <xsl:text> - </xsl:text>
            <xsl:value-of select="text()"/>
            <xsl:text> |</xsl:text>
        </xsl:iterate>
    </xsl:template>
    <xsl:template match="relationships">
        <xsl:iterate select="relationship">
            <xsl:value-of select="@rel_id, @rel_db_id, @db_id, @relcat" separator=" "/>
            <xsl:text> - </xsl:text>
            <xsl:value-of select="@relnotes, text()" separator=" - "/>
            <xsl:text> |</xsl:text>
        </xsl:iterate>
    </xsl:template>
    <xsl:template match="events">
        <xsl:iterate select="event">
            <xsl:value-of select="@sevcode, @evn_id, @db_id, @evnotes" separator=" "/>
            <xsl:text> |</xsl:text>
            <xsl:apply-templates select="event_data"/>
            <xsl:text> |</xsl:text>
        </xsl:iterate>
    </xsl:template>
    <xsl:template match="event">
        <xsl:apply-templates select="event_data"/>
    </xsl:template>
    <xsl:template match="event_data">
        <xsl:value-of select="sbdate, sbdate/@ccalendar, sbdate_dmy" separator=" "/>
    </xsl:template>

</xsl:stylesheet>

It work with a small sample file but not with the whole 161mb file.

Best regards.


Solution

  • Martin has answered quite a lot of the questions, but let me add a few words.

    Your example code

    <xsl:iterate select="db_entry">
        <xsl:apply-templates select="db_entry"/>
    </xsl:iterate>
    

    seems to be a beginner's mistake: unless db_entry actually contains another db_entry element as a child, this should be

    <xsl:iterate select="db_entry">
        <xsl:apply-templates select="."/>
    </xsl:iterate>
    

    The difference between xsl:iterate and xsl:for-each is that with xsl:for-each, each item in the input sequence is processed independently of the others: there is no defined order of processing, and there is no way that the processing of one item can affect the way subsequent items are processed. With xsl:iterate, the items are processed in order, and (by using xsl:next-iteration) you can set variables/parameters when processing an item, which are available for use when processing the next item.

    This difference has nothing directly to do with streaming; however xsl:iterate was introduced because there were use cases (such as computing a running total on a bank account) that were very hard to make streamable without such a construct.

    Your edited code:

    <xsl:iterate select="db_entry">
        <xsl:apply-templates select="public_data"/>
        <xsl:text> |</xsl:text>
        <xsl:apply-templates select="text_data"/>
        <xsl:text> |</xsl:text>
        <xsl:apply-templates select="research_data"/>
        <xsl:text>&#10;</xsl:text>
    </xsl:iterate>
    

    could equally well be written using xsl:for-each, because the processing of an item doesn't depend in any way on the processing of previous items. Either way, however, it wouldn't satisfy the streaming rules, because you are making three "downward selections" within the iteration body, and you are only allowed one. The simplest workaround to this, as Martin has illustrated, is to make a copy of each db_entry (as a tree in memory) and then you can operate on this copy without any streaming constraints.

    Another workaround, if you know that the three child elements occur in the order you are processing them, is to replace:

        <xsl:apply-templates select="public_data"/>
        <xsl:text> |</xsl:text>
        <xsl:apply-templates select="text_data"/>
        <xsl:text> |</xsl:text>
        <xsl:apply-templates select="research_data"/>
        <xsl:text>&#10;</xsl:text>
    

    by

            <xsl:for-each select="*[
                self::public_data or self::text_data or self::research_data]">
              <xsl:if test="position() ne 1"> |</xsl:if>
              <xsl:apply-templates select="."/>
            </xsl:for-each>
            <xsl:text>&#10;</xsl:text>
    

    (Note the little trick of putting a vertical bar before every entry except the first, rather than putting it after every entry except the last. That's because when you're streaming, you don't know when you're about to reach the end. Little things like this become very important when you're trying to make your code streamable.)

    As Martin says, Altova RaptorXML does not support streaming: you will need to use Saxon-EE for this.