Search code examples
xmlxpathxqueryoxygenxml

Refactoring Re-nesting elements with XSLT or XQuery Scripts


I’m currently refactoring batches of XML documents, and the process involves restructuring the xml to a new revised DTD Schema. As a result of using a new DTD, many of the elements originally used were either repurposed, re-nested in other elements, or deleted altogether. The below example is an invalid xml document when validated against the DTD. To hasten the process of refactoring the XML, I thought that maybe an XQuery script or XSLT transformation might be helpful. However, I have zero experience with either, and am still rather new to XML. Could someone explain to me which language whether XQuery, XSLT, or Xpath would be most relevant in restructuring these documents.

Invalid XML:

<PartsDoc foo=”” baa=”” bar=”” revno=”” docno=”” > 
    <PartsDocInfo>
        <repairlvl level=”shop” /> 
        <title id=”123”> Foo Electrical Control Box </title> 
    </PartsDocInfo> 

    <Parts.Category> 

    <figure id=”123” >
        <title id=”123”> Control Box Panels </title> 

    <subfig id=”123”>
            <graphic img=”foo.jpg” /> 
        </subfig>
    <!- - everything above is valid, the below portion is not - ->



<parts.item> 
            <callout id=”123”  config=”123” label=”1” /> 
            <mrs service=”shop” sc=”” mc=”” rec=”” /> 
            <nsn niin=”00-123-4567”> 4444-00-123-5467</nsn> 
            <cageno>12345</cageno>
            <partno>12345</partno>
            <name/>
            <desc id=”123” > Bolt 1/2inch </desc>
            <qty>4</qty>
 <parts.item>   
    </parts.category> 

Desired output:

<PartsDoc foo=”” baa=”” bar=”” revno=”” docno=”” > 

        <PartsDocInfo>
        <repairlvl level=”shop” /> 
        <title id=”123”> Foo Electrical Control Box </title> 
    </PartsDocInfo> 
<Parts.Category> 
    <figure id=”123” >
        <title id=”123”> Control Box Panels </title> 
<subfig id=”123”>
          <graphic img=”foo.jpg” />
</subfig>
    <parts.item> 
        <callout id=”123”  config=”123” label=”1” /> 
<qty>4</qty>
<mrs service=”shop” sc=”” mc=”” rec=”” /> 
<nsn>
        <fsc>4444</fsc>
        <niin>00-12-5467
</nsn>
        <partno>12345</partno>
        <cageno>12345</cageno>
        <name/>
        <desc id=”123” > Bolt 1/2inch </desc>
    <parts.item>    
</parts.category> 

*note that <qty> has moved *note <partno> has moved *note <nsn> not includes children elements with the contents sorted

Additionally, some instances include an <uoc> elements nested in <desc> as a child.

<desc> 
    bolt 1/2inch
        <uoc>XYZ</uoc>
</desc>

Where <uoc> should actually be after <callout>, and before

<qty>

Any help with an XSLT stylesheet or XQuery script would greatly appreciated, and short explanation of why choose one language over the other. I’m currently using Oxygen 17 XML editor


Solution

  • When substantial parts of the output are the same as the input, XSLT generally fits the bill better. The general principle is to write a stylesheet that contains a general rule to copy elements recursively, and then add rules for the elements where you want to do something different.

    In XSLT 3.0 the generic rule is:

    <xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
      version="3.0">
      <xsl:mode on-no-match="shallow-copy"/>
    
      ... other code goes here ...
    </xsl:transform>
    

    While in earlier versions it is:

    <xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
      version="2.0">
    
      <xsl:template match="*">
       <xsl:copy>
        <xsl:copy-of select="@*">
        <xsl:apply-templates/>
       </xsl:copy>
      </xsl:template>
    
      ... other code goes here ...
    </xsl:transform>
    

    Your template rule to reorder parts.item can be written:

    <xsl:template match="parts.item">
      <parts.item>
        <xsl:copy-of select="callout"/>
        <xsl:copy-of select="qty"/>
        <xsl:copy-of select="mrs"/>
        <nsn>
          <fsc><xsl:value-of select="substring-before(nsn, '-')"/></fsc>
          <niin><xsl:value-of select="nsn/@niin"/></niin>
        </nsn>
        <xsl:copy-of select="partno"/>
        <xsl:copy-of select="cageno"/>
        <xsl:copy-of select="name"/>
        <xsl:copy-of select="desc"/>
     </parts.item>
    

    Putting this together, the following XSLT 2.0 stylesheet:

    <xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
        version="2.0">
    
        <xsl:strip-space elements="*"/>
        <xsl:output indent="yes"/>
    
        <xsl:template match="*">
            <xsl:copy>
                <xsl:copy-of select="@*"/>
                <xsl:apply-templates/>
            </xsl:copy>
        </xsl:template>
    
        <xsl:template match="parts.item">
            <parts.item>
                <xsl:copy-of select="callout"/>
                <xsl:copy-of select="qty"/>
                <xsl:copy-of select="mrs"/>
                <nsn>
                    <fsc><xsl:value-of select="substring-before(nsn, '-')"/></fsc>
                    <niin><xsl:value-of select="nsn/@niin"/></niin>
                </nsn>
                <xsl:copy-of select="partno"/>
                <xsl:copy-of select="cageno"/>
                <xsl:copy-of select="name"/>
                <xsl:copy-of select="desc"/>
            </parts.item>
        </xsl:template>
    </xsl:transform>
    

    applied to the following source document:

    <PartsDoc foo="" baa="" bar="" revno="" docno="" > 
        <PartsDocInfo>
            <repairlvl level="shop" /> 
            <title id="123"> Foo Electrical Control Box </title> 
        </PartsDocInfo> 
    
        <Parts.Category> 
    
            <figure id="123" >
            <title id="123"> Control Box Panels </title> 
    
             <subfig id="123">
                        <graphic img="foo.jpg" /> 
             </subfig>
                    <!-- everything above is valid, the below portion is not -->
    
                    <parts.item> 
                        <callout id="123"  config="123" label="1" /> 
                        <mrs service="shop" sc="" mc="" rec="" /> 
                        <nsn niin="00-123-4567"> 4444-00-123-5467</nsn> 
                        <cageno>12345</cageno>
                        <partno>12345</partno>
                        <name/>
                        <desc id="123" > Bolt 1/2inch </desc>
                        <qty>4</qty>
                    </parts.item>
            </figure>
        </Parts.Category>
    </PartsDoc>
    

    produces the following output:

    <?xml version="1.0" encoding="UTF-8"?>
    <PartsDoc foo="" baa="" bar="" revno="" docno="">
       <PartsDocInfo>
          <repairlvl level="shop"/>
          <title id="123"> Foo Electrical Control Box </title>
       </PartsDocInfo>
       <Parts.Category>
          <figure id="123">
             <title id="123"> Control Box Panels </title>
             <subfig id="123">
                <graphic img="foo.jpg"/>
             </subfig>
             <parts.item>
                <callout id="123" config="123" label="1"/>
                <qty>4</qty>
                <mrs service="shop" sc="" mc="" rec=""/>
                <nsn>
                   <fsc> 4444</fsc>
                   <niin>00-123-4567</niin>
                </nsn>
                <partno>12345</partno>
                <cageno>12345</cageno>
                <name/>
                <desc id="123"> Bolt 1/2inch </desc>
             </parts.item>
          </figure>
       </Parts.Category>
    </PartsDoc>