I'd like to:
<?xml version="1.0"?>
<NaaccrData baseDictionaryUri="http://naaccr.org/naaccrxml/naaccr-dictionary-230.xml" recordType="A" timeGenerated="2024-01-02T03:04:05.006-07:00" specificationVersion="1.6">
<Item naaccrId="recordType">A</Item>
<Item naaccrId="registryType">01</Item>
<Item naaccrId="registryId">02</Item>
<Patient>
<Item naaccrNum="1" naaccrId="p1">101</Item>
<Item naaccrNum="2" naaccrId="p2">102</Item>
<Item naaccrNum="3" naaccrId="p3">103</Item>
<Tumor>
<Item naaccrNum="11" naaccrId="t1">111</Item>
<Item naaccrNum="12" naaccrId="t2">112</Item>
</Tumor>
<Tumor>
<Item naaccrNum="11" naaccrId="t1">121</Item>
<!-- notice 122 is missing -->
</Tumor>
</Patient>
<Patient>
<Item naaccrNum="1" naaccrId="p1">201</Item>
<Item naaccrNum="2" naaccrId="p2">202</Item>
<Item naaccrNum="3" naaccrId="p3">203</Item>
<Tumor>
<Item naaccrNum="11" naaccrId="t1">211</Item>
<Item naaccrNum="12" naaccrId="t2">212</Item>
</Tumor>
</Patient>
</NaaccrData>
Desired rectangle 1 & desired rectangle 2:
pt_id,p1,p2,p3
a,101,102,103
b,201,202,203
tumor_id,pt_id,t1,t2
x,a,111,112
y,a,121,
z,b,211,212
This xslt adequately handles the outer/patient level, but not the inner/tumor level. It also doesn't assign ids.
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="yes"/>
<xsl:template match="/table">
<Root><xsl:apply-templates /></Root>
</xsl:template>
<xsl:template match="Patient|Patient/Tumor">
<pt>
<xsl:for-each select="Item">
<xsl:element name="{@naaccrId}"><xsl:value-of select = "."/></xsl:element>
</xsl:for-each>
</pt>
</xsl:template>
</xsl:stylesheet>
I'm using the lxml Python package, but I'm flexible on packages & languages. The xml files are 100+MB.
from lxml import etree
...
# load input
dom = etree.parse(path_raw)
print(dom)
# load XSLT
transform = etree.XSLT(etree.parse(path_xslt))
print(transform)
ds = transform(dom)
print(ds)
How do I produce those two desired rectangles? Is xslt the best approach for this? Am I asking too much of xslt?
Edit: In response to @y.arazim's comments below, here is some desired xml. I'm flexible on this, because the ultimate goal is to upload two tables to a database. I think this output would be good for that, but I'm new to this world and am open to suggestions. Json would be fine with me too.
<pts>
<pt>
<p1>101</p1>
<p2>102</p2>
<p3>103</p3>
<tumors>
<tumor>
<t1>111</t1>
<t2>112</t2>
</tumor>
<tumor>
<t1>121</t1>
<!-- 122 is missing -->
</tumor>
</tumors>
</pt>
<pt>
<p1>201</p1>
<p2>202</p2>
<p3>203</p3>
<tumors>
<tumor>
<t1>211</t1>
<t2>212</t2>
</tumor>
</tumors>
</pt>
</pts>
Maybe something like this could work for you:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/NaaccrData">
<!-- patients -->
<xsl:text>pt_id,p1,p2,p3 </xsl:text>
<xsl:for-each select="Patient">
<xsl:value-of select="generate-id()"/>
<xsl:text>,</xsl:text>
<xsl:for-each select="Item">
<xsl:value-of select="."/>
<xsl:if test="position()!=last()">,</xsl:if>
</xsl:for-each>
<xsl:text> </xsl:text>
</xsl:for-each>
<xsl:text> </xsl:text>
<!-- tumors -->
<xsl:text>tumor_id,pt_id,t1,t2 </xsl:text>
<xsl:for-each select="Patient">
<xsl:variable name="pt_id" select="generate-id()" />
<xsl:for-each select="Tumor">
<xsl:value-of select="generate-id()"/>
<xsl:text>,</xsl:text>
<xsl:value-of select="$pt_id"/>
<xsl:text>,</xsl:text>
<xsl:for-each select="Item">
<xsl:value-of select="."/>
<xsl:if test="position()!=last()">,</xsl:if>
</xsl:for-each>
<xsl:text> </xsl:text>
</xsl:for-each>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
There is an assumption here that each patient has the same number of items, and likewise for tumors.
Added part:
The XML output you added would be easy to produce using:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/NaaccrData">
<pts>
<xsl:apply-templates select="Patient"/>
</pts>
</xsl:template>
<xsl:template match="Patient">
<pt>
<xsl:apply-templates select="Item"/>
<tumors>
<xsl:apply-templates select="Tumor"/>
</tumors>
</pt>
</xsl:template>
<xsl:template match="Tumor">
<tumor>
<xsl:apply-templates select="Item"/>
</tumor>
</xsl:template>
<xsl:template match="Item">
<xsl:element name="{@naaccrId}">
<xsl:value-of select = "."/>
</xsl:element>
</xsl:template>
</xsl:stylesheet>