Search code examples
javaxmlxslt

xslt transformation to deduplicate objects


I'm serializing java objects to XML, and later need to transform that xml via xslt.

In this particular transformation, I need to delete a node (=Java field) called fieldToDelete, which can contain a bunch of irrelevant information. However, it can also contain nodes called gemeinde:

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <fieldToDelete>
        <gemeinde id="10">
            <child1/>
        </gemeinde>
    </fieldToDelete>
    <fieldToDelete>
        <gemeinde id="20">
            <child2/>
        </gemeinde>
    </fieldToDelete>
    <gemeinde reference="10"/>
    <gemeinde reference="20"/>
    <gemeinde reference="10"/>
    <gemeinde reference="20"/>
</root>

I delete these nodes with

<xsl:template match="fieldToDelete"/>

However, because this also deletes <gemeinde>, I need to restore the reference.

The result should look like this:

<root>
    <gemeinde id="10">
        <child1/>
    </gemeinde>
    <gemeinde id="20">
        <child2/>
    </gemeinde>
    <gemeinde reference="10"/>
    <gemeinde reference="20"/>
</root>

The fieldToDelete is gone, and the first <gemeinde reference="1234"> is replaced with the full object, which initially resided inside the fieldToDelete.

My approach:

  1. Duplication: Replace all references to gemeinde with the original, full gemeinde-object.
  2. Delete the unwanted field
  3. De-duplicate: Replace all but the first occurence of <gemeinde> with a reference.

This is my xslt:

<!-- For all <gemeinde reference="<id>">, copy the referenced object instead -->
<xsl:template match="gemeinde[@reference]">
    <xsl:variable name="refId" select="@reference"/>
    <xsl:copy-of select="//gemeinde[@id = $refId]"/>
</xsl:template>

<!-- Delete obsolete field (which contains <gemeinde> sub-element) -->
<xsl:template match="fieldToDelete"/>

<!-- We might now have duplicate <gemeinde> definitions, replace all but the first one with a reference instead -->
<xsl:template match="gemeinde[@id]">
    <xsl:variable name="refId" select="@id"/>

    <xsl:if test="preceding::gemeinde[@id=$refId]">
        <xsl:copy>
            <!-- Rename 'id' attribute to 'reference' -->
            <xsl:attribute name="reference">
                <xsl:value-of select="@id"/>
            </xsl:attribute>
        </xsl:copy>
    </xsl:if>
</xsl:template>

However, this code (in particular the "preceding" clause) doesn't seem to work correctly. The full <gemeinde> nodes are not replaced with references.

Caveats:

  • The id/reference is important. There can be multiple gemeinde objects with different IDs, and they must not be mixed up.
  • It's possible that the <fieldToDelete> does not contain a <gemeinde id="1234">, but a <gemeinde reference="1234">. This can happen if a <gemeinde> node already appeared before the fieldToDelete in the XML. (The first occurence of gemeinde always has an id, all other nodes only have a reference.)

Solution

  • Your approach can only work if you chained two transformation (steps), i.e. either run two different, chained stylesheet or if you use two modes.

    I would suggest to use keys

      <xsl:key name="gemeinde-by-id" match="gemeinde" use="@id"/>
      <xsl:key name="gemeinde-by-ref" match="gemeinde" use="@reference"/>
      
      <xsl:template match="fieldToDelete"/>
      
      <xsl:template match="gemeinde[@reference][. is key('gemeinde-by-ref', @reference)[1]]">
        <xsl:sequence select="key('gemeinde-by-id', @reference)"/>
      </xsl:template>
    
      <xsl:mode on-no-match="shallow-copy"/>
    

    That is XSLT 3 code, if you use an XSLT 1 processor you would need to use xsl:copy-of instead of xsl:sequence and rewrite . is key('gemeinde-by-ref', @reference)[1] as generate-id() = generate-id(key('gemeinde-by-ref', @reference)[1]).

    Furthermore the xsl:mode would need to be spelled out as a template

    <xsl:template match="@* | node()">
      <xsl:copy>
        <xsl:apply-templates select="@* | node()"/>
      </xsl:copy>
    </xsl:template>
    

    Using the XSLT 3 approach at the online fiddle I get the result

    <?xml version="1.0" encoding="UTF-8"?><root>
        
        
        <gemeinde id="10">
                <child1/>
            </gemeinde>
        <gemeinde id="20">
                <child2/>
            </gemeinde>
        <gemeinde reference="10"/>
        <gemeinde reference="20"/>
    </root>