I want to extract short lemmas out of text for some explanatory notes. That is, if the text is too long it should output only the first and the last word. This works:
<?xml version="1.0" encoding="UTF-8"?>
<lemma>
<a><b>I</b> can what I can and <b><c>what</c></b> I can't I can</a>
</lemma>
when this xslt is applied
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
version="2.0">
<xsl:output method="xml" encoding="utf-8" indent="yes"/>
<!-- Identity template : copy all text nodes, elements and attributes -->
<xsl:template match="@*|node()">
<xsl:copy copy-namespaces="no">
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="lemma">
<xsl:value-of select="."/>
<xsl:choose>
<xsl:when test="string-length(normalize-space(a)) > 20">
<xsl:value-of select="tokenize(a,' ')[1]"/>
<xsl:text> […] </xsl:text>
<xsl:value-of select="tokenize(a,' ')[last()]"/>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="a"/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
produces the desired output:
I can what I can and what I can't I can
I […] can
Unfortunately whenever two child elements are immediately adjacent the space in between is coded as child-node named „space“. The above solution doesn't work with:
<lemma>
<a><b>I</b><space/><b>can</b> what I can and what I can't I can</a>
</lemma>
I tried to have the single space-special character processed before, but that doesn't work (and I know why), I just don't know how to do it better. It would work with two XLST-runs, I suppose.
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
version="2.0">
<xsl:output method="xml" encoding="utf-8" indent="yes"/>
<!-- Identity template : copy all text nodes, elements and attributes -->
<xsl:template match="@*|node()">
<xsl:copy copy-namespaces="no">
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="space">
 
</xsl:template>
<xsl:template match="lemma">
<xsl:apply-templates select="space"/>
<xsl:value-of select="."/>
<xsl:choose>
<xsl:when test="string-length(normalize-space(a)) > 20">
<xsl:value-of select="tokenize(a,' ')[1]"/>
<xsl:text> […] </xsl:text>
<xsl:value-of select="tokenize(a,' ')[last()]"/>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="a"/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
Output:
Ican what I can and what I can't I can
Ican […] can
You could do an xsl:apply-templates
to process a
and save it in a variable...
XML Input
<doc>
<lemma>
<a><b>I</b> can what I can and <b><c>what</c></b> I can't I can</a>
</lemma>
<lemma>
<a><b>I</b><space/><b>can</b> what I can and what I can't I can</a>
</lemma>
</doc>
XSLT 2.0
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="space">
<xsl:text> </xsl:text>
</xsl:template>
<xsl:template match="lemma">
<xsl:variable name="a">
<xsl:apply-templates select="a"/>
</xsl:variable>
<xsl:variable name="norm" select="normalize-space($a)"/>
<xsl:variable name="tokens" select="tokenize($norm,'\s')"/>
<xsl:copy>
<result>
<xsl:value-of select="$norm"/>
</result>
<result>
<xsl:value-of select="
if (string-length($norm) > 20) then
concat($tokens[1],' […] ', $tokens[last()])
else $norm"/>
</result>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
XML Output
<doc>
<lemma>
<result>I can what I can and what I can't I can</result>
<result>I […] can</result>
</lemma>
<lemma>
<result>I can what I can and what I can't I can</result>
<result>I […] can</result>
</lemma>
</doc>