How to output rich text (html) field content when outputting to PDF using Apache FOP

I am trying to generate a PDF file using an xAgent and Apache FOP as suggested by Stephen Wissel here: http://www.wissel.net/blog/d6plinks/SHWL-8TNLTV. Most of the process is working fine, the xAgent is called, creates the XML from my document and passes it through the transform to output a PDF. I am just stuck on how to handle the rich text fields. The fields contain user-generated content (created in an xPage) and so contain HTML fragments. Has anyone come up with a good way to output rich text fields along with other content to a PDF?

Rich

Solution

RichText is a [insert something unprintable]. There are a number of considerations:

Do you require RichText in its full client beauty (tabbed tables, OLE, sections, hovers etc.)?
Is the HTML representation of RichText good enough (the one when you look at it through a browser - eventually enriched by AppsFidelity)?

In the former case your probably only avenue is to grab the DXL representation and try to convert that one - I played with that, it seems to be feasible but a long and painful road.

In the later case, you first get hands on the HTML representation. That can be done using the ?OpenField command or the code snipped by Mark.

Now you have HTML, you might want to cleanup using jsoup and then convert that one to XSL:FO. Some guidance can be found here:

A Developerworks article outlining conversion options, including a sample style sheet
A wiki article in the FOP Wiki, pointing to a style sheet and a tool

Unfortunately not a copy/paste solution, but could be doable. Let us know how it goes, the topic seems of general interest for XPages and Domino.

Update
To successfully transform HTML you need to convert it into xHTML. This roughly works like this:

org.jsoup.nodes.Document hDoc = Jsoup.parse(source);
String cleanHTML = hDoc.body().html();
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(false);
InputSource source = new InputSource(new StringReader(cleanHTML));
DocumentBuilder docb = factory.newDocumentBuilder();
Document d = docb.parse(source);
return d;

For an XSLT transformation you don't need to go to a full document first, a InputSource will do just nicely.

Along these lines...

   /* Stylesheet most likely would come from a getResourceAsStream */
   public String getFO(String rawHTML, InputStream styleStream) {
        org.jsoup.nodes.Document hDoc = Jsoup.parse(rawHTML);
        String cleanHTML = hDoc.body().html();
        InputSource source = new InputSource(new StringReader(cleanHTML));
        StreamSource style = new StreamSource(styleStream);
        TransformerFactory tFactory = TransformerFactory.newInstance();
        Transformer transformer = tFactory.newTransformer(style);
        StreamResult xResult = new StreamResult(new StringWriter());
        transformer.setOutputProperty("omit-xml-declaration", "yes");
        transformer.transform(source, xResult);
        String result = xResult.getWriter().toString();
        return result;
   }

Of course you need to add error handling etc. Let us know how it goes