Search code examples
xmleclipsexslt-2.0rtfsaxon

What are some methods for converting RTF text nodes in XML to text using XSLT 2 / Saxon-HE?


I have a large XML dataset that needs to be parsed and converted to CSV. One of the elements in the XML is a procedure, a series of steps. The series of steps originated in a formatted screen where a lot of RTF coding allowed for bulleted lists, font differences, and so on. When exported from the database into my source XML, these formatted instructions became RTF codes in the xml, like this:

<SPECORMETHOD>{\rtf1\ansi\deff0\uc1\ansicpg1252\deftab720{\fonttbl{\f0\fnil\fcharset1 Arial;}{\f1\fnil\fcharset1 Garamond;}{\f2\fnil\fcharset0 Garamond;}{\f3\fnil\fcharset1 WingDings;}}{\colortbl\red0\green0\blue0;\red255\green0\blue0;\red0\green128\blue0;\red0\green0\blue255;\red255\green255\blue0;\red255\green0\blue255;\red128\green0\blue128;\red128\green0\blue0;\red0\green255\blue0;\red0\green255\blue255;\red0\green128\blue128;\red0\green0\blue128;\red255\green255\blue255;\red192\green192\blue192;\red128\green128\blue128;\red0\green0\blue0;}\wpprheadfoot1\paperw12240\paperh15840\margl720\margr720\margt720\margb720\headery720\footery720\endnhere\sectdefaultcl{\*\generator WPTools_5.17;}{\*\listtable{\list\listtemplateid1\listsimple{\listlevel\leveljc0\levelfollow0\levelstartat1\levelspace0\levelindent360\levelnfc0{\leveltext\'02\'00.;}{\levelnumbers\'01;}}\listid1}}{\*\listoverridetable{\listoverride\listid1\listoverridecount0\ls1}}{\ls1\ilvl0{\listtext 1.\tab}\li400\fi-400\plain\f2\fs26 Procedure Step 1.\par{\listtext\fs26 2.\tab}\plain\f2\fs26 Procedure Step 2.\par{\listtext\fs26 3.\tab}\plain\f2\fs26 Procedure Step 3.\par{\listtext\fs26 4.\tab}\plain\f2\fs26 Procedure Step 4.\par{\listtext\fs26 5.\tab}\plain\f2\fs26 Procedure Step 5.\par{\listtext\fs26 6.\tab}\plain\f2\fs26 Procedure Step 6.\par\pard\plain\plain\f2\fs26\par\plain\f2\fs26 Entry dated 02-07-2023\par}}</SPECORMETHOD>

If I save this content as RTF and open it in any word-like program and save it as text, I end up with the desired results:

1. Procedure Step 1.
2. Procedure Step 2.
3. Procedure Step 3. 
4. Procedure Step 4.
5. Procedure Step 5.
6. Procedure Step 6.
Entry dated 02-07-2023

However, I would prefer to do this dynamically in the XSLT flow, since there are tens of thousands of instances of procedures within the XML structure. If I separate them into files, I'd have to re-link them back into their correct position in the XML with extra steps (which is fine if I need to but seems inefficient).

I've tried:

  1. doing some intense pattern matching in XSLT using regular expressions. This helps me get part of the way there, but variations in authors and formatting are making this time consuming and difficult.
  2. I've looked at the Java Swing RTFEditorKit, but have not done any Java/XSLT integration before. I followed some examples in other questions, but receive "Reflexive calls to Java methods are not available under Saxon-HE" indicating I need the PE version. If this solution does work getting -PE is not a problem, but am unsure if it does. Looking for experience in this.

I'm using XML 1.1, XSLT 2.0 via saxon-he-11.3 on Java 17.0.4.1, all through Eclipse IDE 2022-12 (4.26.0).

At the end of the day, I am looking for suggestions in how best to approach this mass conversion of RTF to text within an XML element during XSLT processing.

Thanks, Michael


Solution

  • I found Apache Tika as a converter of RTF to XHTML (https://tika.apache.org/2.7.0/examples.html#Parsing_to_XHTML) and managed to integrate it as an integrated extension function in Saxon 11 HE that takes the rtf string input and converts it into an XdmNode so in XSLT/XPath you can further process it as a normal input tree:

    package org.example;
    
    import net.sf.saxon.s9api.*;
    import org.apache.tika.exception.TikaException;
    import org.apache.tika.metadata.Metadata;
    import org.apache.tika.sax.ToXMLContentHandler;
    import org.xml.sax.ContentHandler;
    import org.xml.sax.SAXException;
    
    import java.io.*;
    import java.net.URI;
    import java.net.URISyntaxException;
    
    import org.apache.tika.parser.AutoDetectParser;
    import org.xml.sax.XMLFilter;
    import org.xml.sax.XMLReader;
    
    import javax.xml.transform.sax.SAXSource;
    import javax.xml.transform.stream.StreamSource;
    
    public class Main {
        public static void main(String[] args) throws SaxonApiException {
            Processor processor = new Processor(false);
    
            processor.registerExtensionFunction(new ExtensionFunction() {
                @Override
                public QName getName() {
                    return new QName("http://example.com/mf/tika", "parse-rtf");
                }
    
                public SequenceType getResultType() {
                    return SequenceType.makeSequenceType(
                            ItemType.ANY_NODE, OccurrenceIndicator.ONE
                    );
                }
                @Override
                public SequenceType[] getArgumentTypes() {
                    return new SequenceType[]{
                            SequenceType.makeSequenceType(
                                    ItemType.STRING, OccurrenceIndicator.ONE)};
                }
    
                @Override
                public XdmValue call(XdmValue[] xdmValues) throws SaxonApiException {
                    try {
                        return parseRtfToHTML(xdmValues[0].itemAt(0).getStringValue(), processor);
                    } catch (IOException | URISyntaxException e) {
                        throw new SaxonApiException(e);
                    } catch (SAXException e) {
                        throw new SaxonApiException(e);
                    } catch (TikaException e) {
                        throw new SaxonApiException(e);
                    }
                }
            });
    
            XsltCompiler xsltCompiler = processor.newXsltCompiler();
    
            Xslt30Transformer xslt30Transformer = xsltCompiler.compile(new StreamSource(new File("sheet1.xsl"))).load30();
    
            XdmValue result = xslt30Transformer.applyTemplates(new StreamSource(new File("sample1.xml")));
    
            System.out.println(result);
        }
    
        public static XdmNode parseRtfToHTML(String rtf, Processor processor) throws IOException, SAXException, TikaException, URISyntaxException {
            DocumentBuilder docBuilder = processor.newDocumentBuilder();
            docBuilder.setBaseURI(new URI("urn:from-string"));
    
            ContentHandler handler = new ToXMLContentHandler();
    
            AutoDetectParser parser = new AutoDetectParser();
            Metadata metadata = new Metadata();
            try (InputStream stream = new ByteArrayInputStream(rtf.getBytes("utf8"))) {
                parser.parse(stream, handler, metadata);
                return docBuilder.build(new StreamSource(new StringReader(handler.toString())));
            } catch (SaxonApiException e) {
                throw new RuntimeException(e);
            }
        }
    }
    

    POM dependencies:

    <dependencies>
        <dependency>
            <groupId>net.sf.saxon</groupId>
            <artifactId>Saxon-HE</artifactId>
            <version>11.4</version>
        </dependency>
        <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-core</artifactId>
            <version>2.7.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-parsers-standard-package</artifactId>
            <version>2.7.0</version>
        </dependency>
    </dependencies>
    

    With a sample like the one in your question and a stylesheet as follows

    <?xml version="1.0" encoding="utf-8"?>
    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                    version="3.0"
                    xmlns:xs="http://www.w3.org/2001/XMLSchema"
                    xmlns:tika="http://example.com/mf/tika"
                    exclude-result-prefixes="#all"
                    expand-text="yes">
    
        <xsl:template match="SPECORMETHOD">
            <rtf-as-xhtml>
                <xsl:sequence select="tika:parse-rtf(.)"/>
            </rtf-as-xhtml>
        </xsl:template>
    
        <xsl:mode on-no-match="shallow-copy"/>
    
        <xsl:output indent="yes"/>
    
        <xsl:template match="/" name="xsl:initial-template">
            <xsl:next-match/>
            <xsl:comment>Run with {system-property('xsl:product-name')} {system-property('xsl:product-version')} {system-property('Q{http://saxon.sf.net/}platform')}</xsl:comment>
        </xsl:template>
    
    </xsl:stylesheet>
    

    the output is e.g.

    <rtf-as-xhtml><html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
    <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.microsoft.rtf.RTFParser"/>
    <meta name="Content-Type" content="application/rtf"/>
    <title/>
    </head>
    <body><p>Procedure Step 1.</p>
    <p>Procedure Step 2.</p>
    <p>Procedure Step 3.</p>
    <p>Procedure Step 4.</p>
    <p>Procedure Step 5.</p>
    <p>Procedure Step 6.</p>
    <p/>
    <p>Entry dated 02-07-2023</p>
    <p/>
    </body></html></rtf-as-xhtml>
    <!--Run with SAXON HE 11.4 -->
    

    So in that simple demo I have made no effort to further process the XHTML returned by Tika from the integrated extension function but of course you can use the full set of XSLT 3.0/XPath 3.1 in Saxon 11 to select or transform it further.