Search code examples
javahtmlxmlxsltcdata

How to wrap HTML content in CData (Java) for XSLT - XML to HTML


Struggling here to wrap HTML content in CData, using Java. The ultimate goal is transforming XML to HTML via XSLT. CData is a requirement. As such, I want the XSLT to ignore the HTML but I'm obviously doing something wrong since it's not preserving the HTML.

<?xml version="1.0" encoding="utf-8" ?>

<content>
    <records>
        <record>
            <param1>1</param1>
            <param2>25</param2>
            <param3>34</param3>
            <param4>b</param4>
            <param5>
                <p>this is html that should be wrapped with CData including the p tags.</p>
            </param5>
        </record>
    </records>
</content>

Here is the code:

DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document doc = docBuilder.parse("test.xml");

doc.getDocumentElement().normalize();

Element param5 = (Element)doc.getElementsByTagName("param5").item(0);
CDATASection cdata = doc.createCDATASection(param5.getTextContent());
param5.appendChild(cdata);

DOMResult domResult = new DOMResult();

transform.setOutputProperty(OutputKeys.CDATA_SECTION_ELEMENTS, "param5");
transform.transform(new DOMSource(doc) , domResult);

So, for param5, the XML file, just before transformation resembles this:

<param5> 
    <![CDATA[
        this is html that should be wrapped with CData including the p tags.
    ]]>
</param5>

When I want

<param5> 
    <![CDATA[
        <p>this is html that should be wrapped with CData including the p tags.</p>
    ]]>
</param5>

I am lost as to what I'm doing wrong here.

Any help would be most appreciated. Thank you.

The XSL is very simple:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:template match="/">
        <html>
            <body>
                <h1><xsl:value-of select="content/records/record/param5"/></h1>
            </body>
        </html>
    </xsl:template>
</xsl:stylesheet>

Here is the sample HTML output that I need:

<html>
    <body>
        <h1>
            <p>this is html that should be wrapped with CData including the p tags.</p>
        </h1>
    </body>
</html>

I'm trying not to over complicate things. The basic problem is I want CData to include both the HTML content and the HTML tags. getTextContent() ignores the p tags. If there was a method that can grab everything inside param5, I'd be set.


Solution

  • If you want to create a CDATA section with the markup of DOM nodes then you first need to serialize those nodes which can be done in Java either using a default transformer or the DOM Load/Save API. So I would create a document fragment node and appendChild all child nodes of the param to the document fragment, the serialize the document fragment to a string then you can use your code to create a CDATA section and appendChild it to the param.

    Here is a simple example, the imports needed are

    import javax.xml.parsers.DocumentBuilder;
    import javax.xml.parsers.DocumentBuilderFactory;
    
    import org.w3c.dom.Document;
    import org.w3c.dom.Element;
    import org.w3c.dom.DocumentFragment;
    
    
    import org.w3c.dom.ls.DOMImplementationLS;
    import org.w3c.dom.ls.LSSerializer;
    

    then the code to read in the document and find the element is as you posted and the DocumentFragment is used to assemble all child nodes removed from the element:

            DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
            docFactory.setNamespaceAware(true);
    
            DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
    
            Document doc = docBuilder.parse("sample1.xml");
    
            DocumentFragment frag1 = doc.createDocumentFragment();
    
            Element param = (Element)doc.getElementsByTagName("param5").item(0);
    
            while (param.hasChildNodes())
            {
                frag1.appendChild(param.getFirstChild());
            }
    

    then the LSSerializer has a writeToString method:

            DOMImplementationLS lsImp = (DOMImplementationLS)doc.getImplementation();
    
            LSSerializer ser = lsImp.createLSSerializer();
            ser.getDomConfig().setParameter("xml-declaration", false);
    
            String xml = ser.writeToString(frag1);
    
            System.out.println(xml);
    
            param.appendChild(doc.createCDATASection(xml));
    
            System.out.println(ser.writeToString(doc));
    

    The document then looks like

    <content>
        <records>
            <record>
                <param1>1</param1>
                <param2>25</param2>
                <param3>34</param3>
                <param4>b</param4>
                <param5><![CDATA[
                    <p>this is html that should be wrapped with CData including the p tags.</p>
                ]]></param5>
            </record>
        </records>
    </content>
    

    Someone at home in the Java world needs to tell you whether the cast to DOMImplementationLS lsImp = (DOMImplementationLS)doc.getImplementation(); is something reliable or whether you need to use the registry, as shown in http://www.java2s.com/Tutorial/Java/0440__XML/GeneratesaDOMfromscratchWritestheDOMtoaStringusinganLSSerializer.htm.