Search code examples
xmlxsltxslt-2.0

XSLT remove child nodes and keep whitespaces with punctuation


I have XML file with <mixed-citation> format which includes some untagged content like whitespaces and punctuation:

<ref>
    <mixed-citation publication-type="book">
        <collab>Collab</collab>. <source>Source</source>. <publisher-loc>Location</publisher-loc>: <publisher-name>Name</publisher-name>; <month>Jul</month> <year>2020</year>. [comment].
        <uri xlink:href="https://www.google.com" xmlns:xlink="http://www.w3.org/1999/xlink">URL</uri>
    </mixed-citation>
</ref>

And I managed to build this semi-functional XSLT so far which copies all node values, keeps whitespaces and punctuation and also removes two child nodes "month" and "uri":

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
    
    <xsl:output method="html" encoding="UTF-8" indent="yes"/>
    <xsl:strip-space elements="*"/>
    <xsl:mode on-no-match="shallow-skip"/>
    
    <xsl:template match="ref">
        <html>
            <p>
                <xsl:apply-templates/>
            </p>
        </html>
    </xsl:template>
    
    <xsl:template match="ref/mixed-citation">
        <p>
            <xsl:apply-templates/>
        </p>
    </xsl:template>
    
    <xsl:template match="ref//text()">
        <xsl:value-of select='normalize-space()'/>
    </xsl:template>
    
    <xsl:template match="ref//month">
    </xsl:template>
    
    <xsl:template match="ref//uri">
    </xsl:template>
    
</xsl:stylesheet>

I would like to create simple output HTML file which would look like this:

<html>
   <p>
      <p>Collab. Source. Location: Name; 2020. [comment].</p>
   </p>
</html

But with the provided XSLT file I am getting wrong output like this:

<html>
   <p>
      <p>Collab.Source.Location:Name;2020. [comment].</p>
   </p>
</html>

What am I doing wrong? Is there maybe an alternative approach to this without using identity transform?

UPDATE:

With the solution provided below by @zx485 the output is correct only if <month> and <uri> are both excluded. If I still leave them there, then output is wrong:

<p>Collab. Source. Location: Name; Jul2020. [comment].URL</p>

It should be:

<p>Collab. Source. Location: Name; Jul 2020. [comment]. URL</p>

Transformation template should actually just parse all the tags, no matter which children are excluded, and always leave all pre-defined whitespaces and punctuation in-place. It should only strip some leading/trailing spaces inside tags if they accidentally appear: i.e. <month> Jul </month> to <month>Jul</month>.

Also, the doubled output was my mistake, I fixed the output above.


Solution

  • You can condense the set of your templates to the following:

    <?xml version="1.0" encoding="UTF-8"?>
    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
        
        <xsl:output method="html" encoding="UTF-8" indent="yes"/>
        <xsl:strip-space elements="*"/>
        <xsl:mode on-no-match="shallow-skip"/>
        
        <xsl:template match="ref">
            <html>
                <p>
                    <xsl:apply-templates/>
                </p>
            </html>
        </xsl:template>
        
        <xsl:template match="ref/mixed-citation">
            <p>
                <xsl:apply-templates/>
            </p>
        </xsl:template>
        
        <xsl:template match="mixed-citation/*/text() | mixed-citation/text()[last()]">
            <xsl:value-of select='normalize-space(.)'/>
        </xsl:template>
    
        <xsl:template match="mixed-citation/text()[position() != last()]">
            <xsl:value-of select='.'/>
        </xsl:template>    
    
        <xsl:template match="ref//(month|uri)" />
        
    </xsl:stylesheet>
    

    The above set of templates copies all text() nodes which are not the last() and omits all month and uri elements that are children of ref.

    The mixed-citation/*/text() | mixed-citation/text()[last()] template rule omits the leading and trailing spaces of all grand-children of mixed-citation or of the last() text() node of mixed-citation.

    The result is as desired:

    <!DOCTYPE HTML>
    <html>
       <p>
          <p>Collab. Source. Location: Name; 2020. [comment].</p>
       </p>
    </html>
    

    This solution does not double the output.
    If that's really what you wanted and not an error, you'd have to double the <p><xsl:apply-templates/></p> in the <xsl:template match="ref/mixed-citation"> template.