Search code examples
xsltunicode

xslt how to 'translate' multi byte unicode characters


I understand (after some pain...), that the translate function will not handle multibyte unicode. I am looking for a solution to this in order to remove all accents from characters. As a sample I have the following transform and its output:

<?xml version="1.0"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xsl:output method="text" encoding="UTF-8"/>
  <xsl:variable name="RSEP" select="'&#10;'"/>  <!-- LF -->
  <xsl:template match="/">
    <xsl:variable name="testwords" select="'à wɔ́rɔ, yɛrɛ, wùri'"/>
    <xsl:value-of select="$testwords"/>
    <xsl:value-of select="$RSEP"/>
    <xsl:value-of select="translate($testwords,
      'àáèéɛ̀ɛ́ɔɔ̀ɔ́ìíòóuùú',
      'aaeeɛɛɔɔɔiioouuu')"/>
    <xsl:value-of select="$RSEP"/>
    <xsl:value-of select="normalize-unicode($testwords)"/>
    <xsl:value-of select="$RSEP"/>
    <xsl:value-of select="replace(normalize-unicode($testwords, 'NFKD'), '\P{IsBasicLatin}', '')"/>
    <xsl:value-of select="$RSEP"/>
  </xsl:template>
</xsl:stylesheet>

Output with xslt3:

à wɔ́rɔ, yɛrɛ, wùri
a wɔɔrɔ, yɛrɛ, wri
à wɔ́rɔ, yɛrɛ, wùri
a wr, yr, wuri

I realize the translate function is not expected to work. But using normalize-unicode does not seem to make any change to the string. And using a 'replace' function scoured elsewhere only seems to process the standard western european accented characters, but not the multibyte.

I have a feeling this may require some kind of regex, but I am just not sure how to go about that. Any help here appreciated.

Thanks!


Solution

  • You're really confusing matters by talking about "multi-byte" Unicode characters. The number of bytes occupied by a character is determined by the encoding (for example, in UTF8 encoding, codepoints in the range 0-127 occupy one byte), but XSLT operations don't depend in any way on the encoding, XSLT is only interested in Unicode as a sequence of codepoints.

    What you are actually talking about here are what Unicode calls combining and modifier characters. There's a great description of these here:

    What is the difference between "combining characters" and "modifier letters"?

    An ordinary character followed by one or more combining or modifier characters can be considered as some kind of composite character, and it is this composite character that you are referring to as a "multi-byte character".

    Now we get to unicode normalization, because some of these "composite characters" have two possible representations, a "composed form" using a single codepoint, and a "decomposed form" comprising a base character and one or more modifiers. When you use the translate() function in XSLT, the result will depend on which form the data takes, and you can force it into either form by using the normalize-unicode() function.

    If you are trying to remove modifiers (such as diacritical marks) from the input then you can force the string into decomposed form, and then call replace() to remove codepoints in the relevant character category (or categories).