XSLT tokenize a string that's distributed across child elements

I have the feeling there's an obvious solution out there, but I can't think of it. Using XSLT 2.0 I want to tokenize a string that's distributed across child elements, so it's something like

    <font style="big">
        <text color="blue">wha</text>
    <font style="small">
        <text color="red">t is o</text>
    <font style="small">
        <text color="blue">n </text>
    <font style="small">
        <text color="blue">his </text>
    <font style="small">
        <text color="blue">mind.</text>

I would like to tokenize the value of the string, i.e., split the string on blanks and punctuation marks, but still keep each segment in its tree structure. So what I want to get:

        <font style="big">
            <text color="blue">wha</text>
        <font style="small">
            <text color="red">t</text>
        <font style="small">
            <text color="red">is</text>
        <font style="small">
            <text color="red">o</text>
        <font style="small">
            <text color="blue">n</text>
      <font style="small">
          <text color="blue">his</text>
    <font style="small">
        <text color="blue">mind</text>
    <font style="small">
      <text color="blue">.</text>

I.E., move every word and punctuation mark into a seperate token element. Now, with just a string, that's easy, and I could use one of analyze-string or matches(), but I can't find an elegant and robust solution for this task.

I'll be thrilled to hear your ideas, Ruprecht


  • This does half the job, tokenising the strings, it doesn't add your <token> markup as if I understand it correctly that requires dictionary lookup to recognise words. It produces

       <font style="big">
          <text color="blue">wha</text>
       <font style="small">
          <text color="red">t</text>
       <font style="small">
          <text color="red">is</text>
       <font style="small">
          <text color="red">o</text>
       <font style="small">
          <text color="blue">n</text>
       <font style="small">
          <text color="blue">his</text>
       <font style="small">
          <text color="blue">mind.</text>


    <xsl:stylesheet version="2.0" xmlns:xsl="">
    <xsl:strip-space elements="*"/>
    <xsl:output indent="yes"/>
    <xsl:template match="*">
      <xsl:copy-of select="@*"/>
    <xsl:template match="font">
     <xsl:variable name="fa" select="@*"/>
     <xsl:for-each select="text">
      <xsl:variable name="ta" select="@*"/>
      <xsl:for-each select="text()/tokenize(.,'\s+')[.]">
        <xsl:copy-of select="$fa"/>
         <xsl:copy-of select="$ta"/>
         <xsl:value-of select="."/>

    OK updated after clarification in comments, it now generates

          <font style="big">
             <text color="blue">wha</text>
          <font style="small">
             <text color="red">t</text>
          <font style="small">
             <text color="red">is</text>
          <font style="small">
             <text color="red">o</text>
          <font style="small">
             <text color="blue">n</text>
          <font style="small">
             <text color="blue">his</text>
          <font style="small">
             <text color="blue">mind</text>
          <font style="small">
             <text color="blue">.</text>


    <xsl:stylesheet version="2.0" xmlns:xsl="">
     <xsl:strip-space elements="*"/>
     <xsl:output indent="yes"/>
     <xsl:template match="*">
       <xsl:copy-of select="@*"/>
     <xsl:template match="*[font]">
       <xsl:copy-of select="@*"/>
       <xsl:variable name="p1">
       <xsl:for-each-group  select="$p1/*" group-starting-with="tok">
         <xsl:copy-of select="current-group() except self::tok"/>
     <xsl:template match="font">
      <xsl:variable name="fa" select="@*"/>
      <xsl:for-each select="text">
       <xsl:variable name="ta" select="@*"/>
       <xsl:if test="position()=1 and matches(.,'^\s')"><tok/></xsl:if>
       <xsl:for-each select="text()/tokenize(.,'\s+')[.]">
        <xsl:if test="position()!=1"><tok/></xsl:if>
        <xsl:analyze-string regex="[.,;?]" select=".">
           <xsl:copy-of select="$fa"/>
        <xsl:copy-of select="$ta"/>
        <xsl:value-of select="."/>
           <xsl:copy-of select="$fa"/>
        <xsl:copy-of select="$ta"/>
        <xsl:value-of select="."/>
       <xsl:if test="position()=last() and matches(.,'\s$')"><tok/></xsl:if>