Search code examples
javahtmlxmlflying-saucer

Flying Saucer not recognizing html entities


I'm trying to use an html file as a template for a pdf, but Flying Saucer isn't recognizing the HTML5 entities (&trade, &nbsp etc). If I replace them with their hex values, then the program runs fine.

My code is as follows:

  public static InputStream create(String content) throws PDFUtilException {

try (ByteArrayOutputStream baos = new ByteArrayOutputStream();) {
  ITextRenderer iTextRenderer = new ITextRenderer();
  iTextRenderer.getSharedContext()
               .setReplacedElementFactory(new MediaReplacedElementFactory(iTextRenderer.getSharedContext()
                                                                                       .getReplacedElementFactory()));

  iTextRenderer.setDocumentFromString(closeOutTags(content), null);
  iTextRenderer.layout();
  iTextRenderer.createPDF(baos);
  return new ByteArrayInputStream(baos.toByteArray());
} catch (IOException | DocumentException e) {
  throw new PDFUtilException("Unable to create PDF", e);
}

}

Thanks,

Oliver


Solution

  • Michael is correct in saying that Flying Saucer needs well-formed XML, but if your only problem are predefined HTML entities (which aren't part of XML), then you can declare them yourself at the begin of your document like so:

    <!DOCTYPE html [
      <!ENTITY % htmlentities SYSTEM "https://www.w3.org/2003/entities/2007/htmlmathml-f.ent">
      %htmlentities;
    ]>
    <!-- your XHTML text following here -->
    

    This pulls-in the entity declarations from their official URL into the htmlentities parameter entity, then references (eg. "executes") the pulled-in declarations. If you only need trade and nbsp, or if Flying Saucer won't allow you to access URLs from the net, you can declare them manually as well:

    <!DOCTYPE html [
      <!ENTITY trade "&#x02122;">
      <!ENTITY nbsp "&#x000A0;">
    ]>
    <!-- your XHTML text following here -->
    

    Now if you actually have a proper HTML (not XHTML) file, then you won't be able to use an XML processor directly with it, because HTML uses markup features not supported by XML (for example, empty elements such as the img element, omitted tags, and attribute shortforms). But you can use an SGML processor to first convert HTML to XHTML (XML), and then use Flying Saucer on the result XML file (SGML is the superset of both HTML and XML, and the original markup language on which HTML and XML are based). The process involves using an HTML DTD grammar such as the original W3C HTML4 DTD (from 1999) or my HTML5 DTD on sgmljs.net plus an SGML processor. Before going into details, though, first check if merely adding entity declarations as already described solves your problem.