I am currently transferring an MS Access query into XML to then use XSLT 3.0 to input the XML into FrameMaker for publishing purposes. In the process, I have to parse through various Access RTF fields using David Carlisle's HTML parser (https://github.com/davidcarlisle/web-xslt/tree/main?tab=readme-ov-file) to transform information I have placed in different fields into nodes in XML.
I am trying to implement 2nd or 3rd level bulleting, which is not supported in RTF natively (at least as implemented in Access). To do this, I am creating text in the field that is then parsed into an XML element to indicate a second level bullet. One problem I am running into is that Access's RTF field is applying <div>
and <font>
nodes around any text elements, so when I try to do this (note that the <list2>
"nodes" are just text residing within a node that then has the HTMLParse applied to it):
- item 1
<list2>
- subitem 1
</list2>
It turns into something closer to this:
- item 1
<div><font><list2></font></div>
- subitem 1
<div><font></list2></font></div>
This of course doesn't work, as this is not well formed XML. Is there a way to convert this using XSLT to the initial block where the subitem is contained within a list node?? One option is to start with
- item 1
<div><font><list2></list2></font></div>
- subitem 1
<div><font><list2></list2></font></div>
if that helps, but I don't know how to get there from here. There may also be some additional issues in that I don't understand node traversal very well, but these list nodes are not nodes in the input XML but are created through the HTMLParse.
Edit: I was trying to simplify things in my example but I have left out important details. I tried to share an XSLT Fiddle link but it wouldn't work. The following are the XML file followed by a shortened version of the XSLT. Note that I am fairly confident the XSLT file is not perfectly configured as I have started building this from no knowledge and added things over time. If you have suggestions to remove redundancies or best practices that I should be using I would also appreciate those.
XML input:
<?xml version="1.0" encoding="UTF-8"?>
<dataroot xmlns:od="urn:schemas-microsoft-com:officedata" generated="2024-03-05T13:21:03">
<TEQuery>
<Description>
<ul>
<li><font face="Times New Roman" color=black>The proposed facility is intended to operate for at least 10 years. </font></li>
</ul>
<div><font face="Times New Roman" color=black>&lt;list2&gt;</font></div>
<ul>
<li><font face="Times New Roman" color=black>between July 1, 2011, and October 5, 2017, receive compensation (wages and benefits) at least 50 percent higher than the per capita personal income (PCPI) for the county at the time of the application, or at least equal to county PCPI while providing health insurance benefits, or,</font></li>
<li><font face="Times New Roman" color=black>since October 6, 2017 (HB 2066, 2017):</font></li>
</ul>
<div><font face="Times New Roman" color=black>&lt;/list2&gt;</font></div>
<div><font face="Times New Roman" color=black>&lt;list3&gt;</font></div>
<div><font face="Times New Roman" color=black>(i) receive compensation meeting the above minimums or that is at least 30 percent more than county PCPI for locations outside any metropolitan statistical area, and</font></div>
<div><font face="Times New Roman" color=black>(ii) (in all cases) receive an average annual wage at least equal to the then current average wage for the county. </font></div>
<div><font face="Times New Roman" color=black>&lt;/list3&gt;</font></div>
</Description>
</TEQuery>
</dataroot>
XSLT
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:dc="data:,dpc"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:mf="http://example.com/mf"
exclude-result-prefixes="#all">
<xsl:output method="xml" omit-xml-declaration="no" encoding="UTF-8" indent="yes" />
<xsl:import href="https://raw.githubusercontent.com/davidcarlisle/web-xslt/main/htmlparse/htmlparse.xsl"/>
<xsl:strip-space elements="* except TitleTab"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:template match=
"*[not(node())]
|
*[not(node()[2])
and
node()/self::text()
and
not(normalize-space())
]
"/>
<xsl:template match="*"> <!--this is here to remove namespace prefixes that were propogating weirdly all over-->
<!-- remove element prefix -->
<xsl:element name="{local-name()}">
<!-- process attributes -->
<xsl:for-each select="@*">
<!-- remove attribute prefix -->
<xsl:attribute name="{local-name()}">
<xsl:value-of select="."/>
</xsl:attribute>
</xsl:for-each>
<xsl:apply-templates/>
</xsl:element>
</xsl:template>
<!-- may want to add tab into each list item. If so, I can do so here-->
<xsl:template match="li">
<li>
<paragraph>
<xsl:apply-templates/>
</paragraph>
</li>
</xsl:template>
<!-- removes some weird formatting that comes out of the rich text format fields. NEEDS FIXED TO NOT REMOVE NESTED LISTS!!! -->
<xsl:template match="font">
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="div">
<paragraph>
<xsl:apply-templates
select="node()[boolean(normalize-space(translate(., ' ', ' ')))]
|@*"/>
</paragraph>
</xsl:template>
<!-- converts weird rtf formatting to a list format that FrameMaker can work with NEED to allow nested lists, still need to figure out how-->
<xsl:template match="ul">
<xsl:choose>
<xsl:when test="child::*[1][self::ul]">
<xsl:apply-templates/>
</xsl:when>
<xsl:otherwise>
<unorderedList>
<xsl:apply-templates/>
</unorderedList>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
<!--does same for ordered lists-->
<xsl:template match="ol">
<xsl:choose>
<xsl:when test="child::*[1][self::ol]">
<xsl:apply-templates/>
</xsl:when>
<xsl:otherwise>
<orderedList>
<xsl:apply-templates/>
</orderedList>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
<xsl:function name="mf:wrap-lists" as="node()*"> <!-- function to help wrap the list items to create second and third order lists-->
<xsl:param name="nodes" as="node()*"/>
<xsl:param name="list-level" as="xs:integer"/>
<xsl:for-each-group select="$nodes" group-starting-with="div[*[1][self::font[. = '<' || 'list' || $list-level || '>']]]">
<xsl:choose>
<xsl:when test="self::div[*[1][self::font[. = '<' || 'list' || $list-level || '>']]]">
<xsl:for-each-group select="tail(current-group())" group-ending-with="div[*[1][self::font[. = '</' || 'list' || $list-level || '>']]]">
<xsl:choose>
<xsl:when test="current-group()[last()][self::div[*[1][self::font[. = '</' || 'list' || $list-level || '>']]]]">
<xsl:element name="list{$list-level}">
<xsl:apply-templates select="current-group()[position() ne last()]"/>
</xsl:element>
</xsl:when>
<xsl:otherwise>
<xsl:apply-templates select="current-group()"/>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each-group>
</xsl:when>
<xsl:otherwise>
<xsl:apply-templates select="current-group()"/>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each-group>
</xsl:function>
<!-- This template does (optional recursively) what you need without the need of matching specific elements. This parses the RTF field that is similar to HTML, allowing us to manually create nodes withing other nodes. For example lists or graphics.-->
<xsl:template match="Description">
<xsl:copy>
<xsl:sequence select="fold-left(2 to 3, dc:htmlparse(., '', false())!node(), function($n, $c) { mf:wrap-lists($n, $c) })"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
''' Gives output
<Description xmlns:od="urn:schemas-microsoft-com:officedata">
<unorderedList>
<li>
<paragraph>
<paragraph>The proposed facility is intended to operate for at least 10 years. </paragraph>
</paragraph>
</li>
</unorderedList>
<list2>
<unorderedList>
<li>
<paragraph>
<paragraph>between July 1, 2011, and October 5, 2017, receive compensation (wages and benefits) at least 50 percent higher than the per capita personal income (PCPI) for the county at the time of the application, or at least equal to county PCPI while providing health insurance benefits, or,</paragraph>
</paragraph>
</li>
<li>
<paragraph>
<paragraph>since October 6, 2017 (HB 2066, 2017):</paragraph>
</paragraph>
</li>
</unorderedList>
</list2>
<paragraph><list3></paragraph>
<paragraph>(i) receive compensation meeting the above minimums or that is at least 30 percent more than county PCPI for locations outside any metropolitan statistical area, and</paragraph>
<paragraph>(ii) (in all cases) receive an average annual wage at least equal to the then current average wage for the county. </paragraph>
<paragraph></list3></paragraph>
</Description>
I wonder whether you should use the htmlparse
function once to convert i.e. the content of a Description
elements into element nodes, some of which are div
s with font
children having the escaped <listX>
start and </listX>
end tags.
If the input is as simple and regular I would try to use a nested for-each-group group-starting-with/group-ending-with
to transform the escaped list start/end tags into listX
wrapper elements.
So for instance, to take part of your sample input, you could process e.g.
<Description>
<ul>
<li><font face="Times New Roman" color=black>The proposed facility is intended to operate for at least 10 years. </font></li>
</ul>
<div><font face="Times New Roman" color=black>&lt;list2&gt;</font></div>
<ul>
<li><font face="Times New Roman" color=black>between July 1, 2011, and October 5, 2017, receive compensation (wages and benefits) at least 50 percent higher than the per capita personal income (PCPI) for the county at the time of the application, or at least equal to county PCPI while providing health insurance benefits, or,</font></li>
<li><font face="Times New Roman" color=black>since October 6, 2017 (HB 2066, 2017):</font></li>
</ul>
<div><font face="Times New Roman" color=black>&lt;/list2&gt;</font></div>
<div><font face="Times New Roman" color=black>&lt;list3&gt;</font></div>
<div><font face="Times New Roman" color=black>(i) receive compensation meeting the above minimums or that is at least 30 percent more than county PCPI for locations outside any metropolitan statistical area, and</font></div>
<div><font face="Times New Roman" color=black>(ii) (in all cases) receive an average annual wage at least equal to the then current average wage for the county. </font></div>
<div><font face="Times New Roman" color=black>&lt;/list3&gt;</font></div>
</Description>
with code like
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="3.0"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:dc="data:,dpc"
exclude-result-prefixes="#all"
xmlns:mf="http://example.com/mf"
expand-text="yes">
<xsl:function name="mf:wrap-lists" as="node()*">
<xsl:param name="nodes" as="node()*"/>
<xsl:param name="list-level" as="xs:integer"/>
<xsl:for-each-group select="$nodes" group-starting-with="div[*[1][self::font[. = '<' || 'list' || $list-level || '>']]]">
<xsl:choose>
<xsl:when test="self::div[*[1][self::font[. = '<' || 'list' || $list-level || '>']]]">
<xsl:for-each-group select="tail(current-group())" group-ending-with="div[*[1][self::font[. = '</' || 'list' || $list-level || '>']]]">
<xsl:choose>
<xsl:when test="current-group()[last()][self::div[*[1][self::font[. = '</' || 'list' || $list-level || '>']]]]">
<xsl:element name="list{$list-level}">
<xsl:apply-templates select="current-group()[position() ne last()]"/>
</xsl:element>
</xsl:when>
<xsl:otherwise>
<xsl:apply-templates select="current-group()"/>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each-group>
</xsl:when>
<xsl:otherwise>
<xsl:apply-templates select="current-group()"/>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each-group>
</xsl:function>
<xsl:import href="https://raw.githubusercontent.com/davidcarlisle/web-xslt/main/htmlparse/htmlparse.xsl"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:template match="Description">
<xsl:copy>
<xsl:sequence select="fold-left(2 to 3, dc:htmlparse(., '', false())!node(), function($n, $c) { mf:wrap-lists($n, $c) })"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
and would get e.g.
<Description>
<ul>
<li><font face="Times New Roman" color="black">The proposed facility is intended to operate for at least 10 years. </font></li>
</ul>
<list2>
<ul>
<li><font face="Times New Roman" color="black">between July 1, 2011, and October 5, 2017, receive compensation (wages and benefits) at least 50 percent higher than the per capita personal income (PCPI) for the county at the time of the application, or at least equal to county PCPI while providing health insurance benefits, or,</font></li>
<li><font face="Times New Roman" color="black">since October 6, 2017 (HB 2066, 2017):</font></li>
</ul>
</list2>
<list3>
<div><font face="Times New Roman" color="black">(i) receive compensation meeting the above minimums or that is at least 30 percent more than county PCPI for locations outside any metropolitan statistical area, and</font></div>
<div><font face="Times New Roman" color="black">(ii) (in all cases) receive an average annual wage at least equal to the then current average wage for the county. </font></div>
</list3>
</Description>
I hope that helps approaching solving part of the "unwrap/enclose" problem.
As for your comment that the code breaks if you just throw it into your existing other code, yes, that can easily happen, use modes to separate processing steps, for instance, to ensure you first run the contents of Description
through my suggested function use a different mode e.g.
<xsl:mode name="wrap" on-no-match="shallow-copy"/>
<xsl:function name="mf:wrap-lists" as="node()*"> <!-- function to help wrap the list items to create second and third order lists-->
<xsl:param name="nodes" as="node()*"/>
<xsl:param name="list-level" as="xs:integer"/>
<xsl:for-each-group select="$nodes" group-starting-with="div[*[1][self::font[. = '<' || 'list' || $list-level || '>']]]">
<xsl:choose>
<xsl:when test="self::div[*[1][self::font[. = '<' || 'list' || $list-level || '>']]]">
<xsl:for-each-group select="tail(current-group())" group-ending-with="div[*[1][self::font[. = '</' || 'list' || $list-level || '>']]]">
<xsl:choose>
<xsl:when test="current-group()[last()][self::div[*[1][self::font[. = '</' || 'list' || $list-level || '>']]]]">
<xsl:element name="list{$list-level}">
<xsl:apply-templates select="current-group()[position() ne last()]" mode="wrap"/>
</xsl:element>
</xsl:when>
<xsl:otherwise>
<xsl:apply-templates select="current-group()" mode="wrap"/>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each-group>
</xsl:when>
<xsl:otherwise>
<xsl:apply-templates select="current-group()" mode="wrap"/>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each-group>
</xsl:function>
<xsl:template match="Description">
<xsl:copy>
<xsl:apply-templates select="fold-left(2 to 3, dc:htmlparse(., '', false())!node(), function($n, $c) { mf:wrap-lists($n, $c) })"/>
</xsl:copy>
</xsl:template>
That way the result for your shorter input sample in the edit, but with the other code from your XSLT sample is
<dataroot generated="2024-03-05T13:21:03">
<TEQuery>
<Description xmlns:od="urn:schemas-microsoft-com:officedata">
<unorderedList>
<li>
<paragraph>The proposed facility is intended to operate for at least 10 years. </paragraph>
</li>
</unorderedList>
<list2>
<unorderedList>
<li>
<paragraph>between July 1, 2011, and October 5, 2017, receive compensation (wages and benefits) at least 50 percent higher than the per capita personal income (PCPI) for the county at the time of the application, or at least equal to county PCPI while providing health insurance benefits, or,</paragraph>
</li>
<li>
<paragraph>since October 6, 2017 (HB 2066, 2017):</paragraph>
</li>
</unorderedList>
</list2>
<list3>
<paragraph>(i) receive compensation meeting the above minimums or that is at least 30 percent more than county PCPI for locations outside any metropolitan statistical area, and</paragraph>
<paragraph>(ii) (in all cases) receive an average annual wage at least equal to the then current average wage for the county. </paragraph>
</list3>
</Description>
</TEQuery>
</dataroot>