Search code examples
xsltxslt-2.0xslt-grouping

XSLT to split text data into group of multiple lines


I am trying to write an XSLT code which splits the text data having multiple lines and produces an XML which contains group of multiple fixed number of lines from the text data.

For example, If my input XML is like this

<?xml version="1.0" encoding="UTF-8"?>
<csv>
    <data>Id,Name,Address,Location,Extid,contact
          1,raagu1,hosakote1,bangalore1,123,contact1
          2,raagu2,hosakote2,bangalore2,123,contact2
          3,raagu3,hosakote3,bangalore3,123,contact3
          4,raag4,hosakote4,bangalore4,123,contact4
          5,raagu5,hosakote5,bangalore5,123,contact5
          6,raagu6,hosakote6,bangalore6,123,contact6
          7,raagu7,hosakote7,bangalore7,123,contact7
    </data>
</csv>

where the text data under element data tells, the first line (Id,Name,Address,Location,Extid,contact) is header and rest of the lines are data corresponding to the header fields.

When I say fixed length for lines is 4 i,e. group of 4 data sets, then my output XML should be like this.

<?xml version="1.0" encoding="UTF-8"?>
<csv>
    <data>
        Id,Name,Address,Location,Extid,contact
        1,raagu1,hosakote1,bangalore1,123,contact1
        2,raagu2,hosakote2,bangalore2,123,contact2
        3,raagu3,hosakote3,bangalore3,123,contact3
        4,raag4,hosakote4,bangalore4,123,contact4
    </data>
    <data>
        Id,Name,Address,Location,Extid,contact
        5,raagu5,hosakote5,bangalore5,123,contact5
        6,raagu6,hosakote6,bangalore6,123,contact6
        7,raagu7,hosakote7,bangalore7,123,contact6
    </data>
</csv>

To achieve this I have explored on xslt scripts and tried following XSLT

 <xsl:stylesheet version = "2.0" xmlns:xsl = "http://www.w3.org/1999/XSL/Transform">

<xsl:output indent="yes" method="xml" encoding="UTF-8"/>

<xsl:template match = "/csv/data">

    <xsl:variable name="header" select="substring-before(.,'&#10;')"/>
    <xsl:variable name="data" select="substring-after(.,'&#10;')"/>

    <csv>

        <xsl:for-each select = "tokenize($data, '\n')">

            <xsl:variable name="count" select="position()"/>

            <data>
                <xsl:value-of select="$header"/>
                <xsl:text>&#10;</xsl:text>
                <xsl:sequence select = "."/>
            </data>

        </xsl:for-each>

    </csv>

</xsl:template>
</xsl:stylesheet>

With this, the output I got was

<?xml version="1.0" encoding="UTF-8"?>
<csv>
<data>
    Id,Name,Address,Location,Extid,contact
    1,raagu1,hosakote1,bangalore1,123,contact1
</data>
<data>
    Id,Name,Address,Location,Extid,contact
    2,raagu2,hosakote2,bangalore2,123,contact2
</data>
<data>
    Id,Name,Address,Location,Extid,contact
    3,raagu3,hosakote3,bangalore3,123,contact3
</data>
<data>
    Id,Name,Address,Location,Extid,contact
    4,raag4,hosakote4,bangalore4,123,contact4
</data>
<data>
    Id,Name,Address,Location,Extid,contact
    5,raagu5,hosakote5,bangalore5,123,contact5
</data>
<data>
    Id,Name,Address,Location,Extid,contact
    6,raagu6,hosakote6,bangalore6,123,contact6
</data>
<data>
    Id,Name,Address,Location,Extid,contact
    7,raagu7,hosakote7,bangalore7,123,contact7
</data>
</csv>

I could not quite get it right since for every line it is grouping. I think I missing some thing to do with concatenation. I am looking for some help to see whether are they any functions in xslt using which we can split the text into multiple groups lines and create a single element for each of those group with very good performance? I am ok for xslt 2.0 functions. The code should work even for 1,00,000+ data sets.

Thanks

Raagu


Solution

  • Do you really want to create that XML result format that continues to have comma separated data and line separated data? I would consider to clean up the data and mark it up properly with XML.

    But as for the grouping, here is an example:

    <xsl:stylesheet version = "2.0" xmlns:xsl = "http://www.w3.org/1999/XSL/Transform"
      xmlns:xs="http://www.w3.org/2001/XMLSchema"
      exclude-result-prefixes="xs">
    
    <xsl:param name="chunk-size" select="4" as="xs:integer"/>
    
    <xsl:output indent="yes" method="xml" encoding="UTF-8"/>
    
    <xsl:template match = "/csv/data">
    
        <xsl:variable name="header" select="substring-before(.,'&#10;')"/>
        <xsl:variable name="data" select="substring-after(.,'&#10;')"/>
    
        <csv>
    
            <xsl:for-each-group select = "tokenize($data, '\n')" group-adjacent="(position() - 1) idiv $chunk-size">
    
    
    
                <data>
                    <xsl:value-of select="$header"/>
                    <xsl:text>&#10;</xsl:text>
                    <xsl:value-of select = "current-group()" separator="&#10;"/>
                </data>
    
            </xsl:for-each-group>
    
        </csv>
    
    </xsl:template>
    </xsl:stylesheet>