In XSLT 1.0, a common question in forums was how to convert flat HTML into hierarchical XML, which many times boiled down to nesting text in between <br />
tags in <p>
tags.
I have a similar problem, which I think I've partially solved using XSLT 2.0, but it's a new approach to me and I'd like to get a second opinion.
The XHTML source has <span class="pageStart"></span>
scattered throughout. They can appear in several different parent nodes. I want to wrap all the nodes between one page start marker and the next in an <page>
node. The solution I currently have is:
<xsl:template match="*[child::span[@class='pageStart']]">
<xsl:copy>
<xsl:copy-of select="@*" />
<xsl:for-each-group select="node()"
group-starting-with="span[@class='pageStart']">
<page>
<xsl:apply-templates select="current-group()"/>
</page>
</xsl:for-each-group>
</xsl:copy>
</xsl:template>
There's at least one flaw with this -- the parent node of the marker gets a <page>
as a child node when I don't want it. In other works, if there's a <div>
that has a child page marker anywhere in it, an <page>
node is created as an immediate child of <div>
in addition to the locations I expect.
I had hoped that I could simply make the template rule be <xsl:template match="span[@class='pageStart']">
but current-group() seems to be empty no matter what I try. The common sense approach I tried was <xsl:for-each-group select="node()" group-starting-with="span[@class='pageStart']">
.
Is there an easier way to solve this problem that I'm missing?
EDIT
Here's an example of the input:
<?xml version="1.0" encoding="UTF-8"?>
<html>
<head></head>
<body>
<span class="pageStart"/>
<p>...</p>
<div>...</div>
<img />
<p></p>
<span class="pageStart"/>
<div>...</div>
<span class="pageStart"/>
<p>...</p>
<div>
<span class="pageStart"/>
<p>...</p>
<p>...</p>
<span class="pageStart"/>
<div>...</div>
<img/>
</div>
</body>
</html>
I assume the last two nested pages make this problem more difficult, so I'd be perfectly happy getting this as the output, or something close:
<?xml version="1.0" encoding="UTF-8"?>
<html>
<head></head>
<body>
<page>
<span class="pageStart"/>
<p>...</p>
<div>...</div>
<img />
<p></p>
</page>
<page>
<span class="pageStart"/>
<div>...</div>
</page>
<page>
<span class="pageStart"/>
<p>...</p>
<div>
<page>
<span class="pageStart"/>
<p>...</p>
<p>...</p>
</page>
<page>
<span class="pageStart"/>
<div>...</div>
<img/>
</page>
</div>
</page>
</body>
</html>
This transformation:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="*[span/@class='pageStart']">
<xsl:copy>
<xsl:copy-of select="@*"/>
<xsl:for-each-group select="node()"
group-starting-with="span[@class='pageStart']">
<page>
<xsl:apply-templates select="current-group()"/>
</page>
</xsl:for-each-group>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
when applied on the provided XML document:
<html>
<head></head>
<body>
<span class="pageStart"/>
<p>...</p>
<div>...</div>
<img />
<p></p>
<span class="pageStart"/>
<div>...</div>
<span class="pageStart"/>
<p>...</p>
<div>
<span class="pageStart"/>
<p>...</p>
<p>...</p>
<span class="pageStart"/>
<div>...</div>
<img/>
</div>
</body>
</html>
produces the wanted, correct result:
<html>
<head/>
<body>
<page>
<span class="pageStart"/>
<p>...</p>
<div>...</div>
<img/>
<p/>
</page>
<page>
<span class="pageStart"/>
<div>...</div>
</page>
<page>
<span class="pageStart"/>
<p>...</p>
<div>
<page>
<span class="pageStart"/>
<p>...</p>
<p>...</p>
</page>
<page>
<span class="pageStart"/>
<div>...</div>
<img/>
</page>
</div>
</page>
</body>
</html>