Search code examples
xmlxslt

split xml into tables and add ids


I'd like to:

  1. transform the nested xml into two tables/rectangles with xslt. (I've written the tables as csv, but It's fine if they're non-nested xml.)
  2. arbitrary assign ids at both levels (eg, patient & tumor). (I've used letters so they stand out in this example, but sequential digits are probably preferable.)
<?xml version="1.0"?>
<NaaccrData baseDictionaryUri="http://naaccr.org/naaccrxml/naaccr-dictionary-230.xml" recordType="A" timeGenerated="2024-01-02T03:04:05.006-07:00"  specificationVersion="1.6">
    <Item naaccrId="recordType">A</Item>
    <Item naaccrId="registryType">01</Item>
    <Item naaccrId="registryId">02</Item>
    <Patient>
        <Item naaccrNum="1" naaccrId="p1">101</Item>
        <Item naaccrNum="2" naaccrId="p2">102</Item>
        <Item naaccrNum="3" naaccrId="p3">103</Item>
        <Tumor>
            <Item naaccrNum="11" naaccrId="t1">111</Item>
            <Item naaccrNum="12" naaccrId="t2">112</Item>
        </Tumor>
        <Tumor>
            <Item naaccrNum="11" naaccrId="t1">121</Item>
            <!-- notice 122 is missing -->
        </Tumor>
    </Patient>
    <Patient>
        <Item naaccrNum="1" naaccrId="p1">201</Item>
        <Item naaccrNum="2" naaccrId="p2">202</Item>
        <Item naaccrNum="3" naaccrId="p3">203</Item>
        <Tumor>
            <Item naaccrNum="11" naaccrId="t1">211</Item>
            <Item naaccrNum="12" naaccrId="t2">212</Item>
        </Tumor>
    </Patient>
</NaaccrData>

Desired rectangle 1 & desired rectangle 2:

pt_id,p1,p2,p3
a,101,102,103
b,201,202,203

tumor_id,pt_id,t1,t2
x,a,111,112
y,a,121,
z,b,211,212

This xslt adequately handles the outer/patient level, but not the inner/tumor level. It also doesn't assign ids.

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" omit-xml-declaration="yes"/>

    <xsl:template match="/table">
        <Root><xsl:apply-templates /></Root>
    </xsl:template>

    <xsl:template match="Patient|Patient/Tumor">
        <pt>
            <xsl:for-each select="Item">
              <xsl:element name="{@naaccrId}"><xsl:value-of select = "."/></xsl:element>
            </xsl:for-each>
        </pt>
    </xsl:template>
</xsl:stylesheet>

I'm using the lxml Python package, but I'm flexible on packages & languages. The xml files are 100+MB.

from lxml import etree

...

# load input
dom = etree.parse(path_raw)
print(dom)
# load XSLT
transform = etree.XSLT(etree.parse(path_xslt))
print(transform)

ds = transform(dom)
print(ds)

How do I produce those two desired rectangles? Is xslt the best approach for this? Am I asking too much of xslt?

Edit: In response to @y.arazim's comments below, here is some desired xml. I'm flexible on this, because the ultimate goal is to upload two tables to a database. I think this output would be good for that, but I'm new to this world and am open to suggestions. Json would be fine with me too.

<pts>
  <pt>
    <p1>101</p1>
    <p2>102</p2>
    <p3>103</p3>
    <tumors>
      <tumor>
        <t1>111</t1>
        <t2>112</t2>
      </tumor>
      <tumor>
        <t1>121</t1>
        <!-- 122 is missing -->
      </tumor>
    </tumors>
  </pt>
  <pt>
    <p1>201</p1>
    <p2>202</p2>
    <p3>203</p3>
    <tumors>
      <tumor>
        <t1>211</t1>
        <t2>212</t2>
      </tumor>
    </tumors>
  </pt>
</pts>

Solution

  • Maybe something like this could work for you:

    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="text"/>
    
    <xsl:template match="/NaaccrData">
    
        <!-- patients -->
        <xsl:text>pt_id,p1,p2,p3&#10;</xsl:text>
        <xsl:for-each select="Patient">
            <xsl:value-of select="generate-id()"/>
            <xsl:text>,</xsl:text>
            <xsl:for-each select="Item">
                <xsl:value-of select="."/>
                <xsl:if test="position()!=last()">,</xsl:if>
            </xsl:for-each>
            <xsl:text>&#10;</xsl:text>
        </xsl:for-each>
        <xsl:text>&#10;</xsl:text>
        
        <!-- tumors -->
        <xsl:text>tumor_id,pt_id,t1,t2&#10;</xsl:text>
        <xsl:for-each select="Patient">
            <xsl:variable name="pt_id" select="generate-id()" />
            <xsl:for-each select="Tumor">
                <xsl:value-of select="generate-id()"/>
                <xsl:text>,</xsl:text>
                <xsl:value-of select="$pt_id"/>
                <xsl:text>,</xsl:text>
                <xsl:for-each select="Item">
                    <xsl:value-of select="."/>
                    <xsl:if test="position()!=last()">,</xsl:if>
                </xsl:for-each>
                <xsl:text>&#10;</xsl:text>
            </xsl:for-each>
        </xsl:for-each>
    </xsl:template>
    
    </xsl:stylesheet>
    

    There is an assumption here that each patient has the same number of items, and likewise for tumors.


    Added part:

    The XML output you added would be easy to produce using:

    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" indent="yes"/>
    
    <xsl:template match="/NaaccrData">
        <pts>
            <xsl:apply-templates select="Patient"/>
        </pts>
    </xsl:template>
        
    <xsl:template match="Patient">
        <pt>
            <xsl:apply-templates select="Item"/>
            <tumors>
                <xsl:apply-templates select="Tumor"/>
            </tumors>
        </pt>
    </xsl:template>
    
    <xsl:template match="Tumor">
        <tumor>
            <xsl:apply-templates select="Item"/>
        </tumor>
    </xsl:template> 
    
    <xsl:template match="Item">
        <xsl:element name="{@naaccrId}">
            <xsl:value-of select = "."/>
        </xsl:element>
    </xsl:template>
    
    </xsl:stylesheet>