Search code examples
xmlregexxsltxslt-2.0

How do I remove trailing emdashes in XSLT?


i am working on a xslt 2.0 with replace function and trying to do a xml to text conversion. I am trying to drop an em-dash (—) from conversion if it is present at the end of an xml tag.

e.g

<abc> Hello World —</abc>

should be output as

Hello World

but if em dash is present anywhere else , it should be retained, eg.

<abc> Hello —World </abc>

should be output as

Hello —World

What i have tried:

<xsl:template match="text()">
<xsl:value-of select="replace(.,'—\s\*&lt;','')"/>
</xsl:template>

but it didn't work

so basically '—\s\*&lt;' pattern is not working. i am reading it as emdash followed by any number of spaces and then opening tag, but i think i am wrong somewhere.

Any inputs would be really helpful.


Solution

  • You did not provide enough information to diagnose the problem, but I think I can guess. Your problem is that you misunderstood how an XSLT processor "sees" an XML document.

    XML Trees

    When you provide a source XML file to your XSLT processor, it is parsed by an XML parser (which is fairly independent of your XSLT processor). The parser fufils a range of different tasks (for example, it could normalize whitespace characters) but the most important thing is: it constructs an abstract model of the source XML, a so-called tree. In XSLT, this tree model is called XDM. So, when the XSLT processor finally gets to see the XML document, it is an abstract tree.

    This is relevant because the tree model consists of nodes that do not have the tags < and > to mark the start and end of an element. That is why you cannot find them with a regular expression.

    How, then, to find dashes at the end of a string?

    The template you mention matches text nodes:

    <xsl:template match="text()">
    

    To find dashes that are at the end of a string, use:

    <xsl:value-of select="replace(.,'-\s*$','')"/>
    

    This replaces a dash, followed by zero or any number of whitespace characters, followed by the end of the string with an empty string. Note that not only the dash is removed - the whitespaces will be gone, too.


    It might help to use an external service to test your regexes before using them in XSLT.