Search code examples

Convert malformed HTML to PDF using Flying Saucer PDF Rendering

In a project GitHub I'm trying to convert any arbitrary HTML string into a PDF version. By convert I mean parse the HTML, and render it into a PDF file.

To achieve that I'm using Flying Saucer PDF Rendering like this:

public class Main {

    public static void main(String [] args) {
        final String ok = "<valid html here>: see github rep for real html markup here";
        final String html = "<invalid html here>: see github rep for real html markup here";
        try {
            // final byte[] bytes = generatePDFFrom(ok); // works!
            final byte[] bytes = generatePDFFrom(html); // does NOT work :(
            try(FileOutputStream fos = new FileOutputStream("sample-file.pdf")) {

        } catch (IOException | DocumentException e) {

    private static byte[] generatePDFFrom(String html) throws IOException, DocumentException {
        final ITextRenderer renderer = new ITextRenderer();
        try (ByteArrayOutputStream fos = new ByteArrayOutputStream(html.length())) {
            return fos.toByteArray();

In the above code if I use the html string stored in ok variable (this is a "valid" html), it creates the PDF correctly (if you run the GitHub project by using the ok variable it will create a file sample-file.pdf inside the project folder with some rendered html).

Now, if I use the value in html variable (html with invalid tags, tags maybe not closed properly, etc) it throws the following error (the error can vary depending on the incorrect value):

ERROR:  'The markup in the document following the root element must be well-formed.'
Exception in thread "main" org.xhtmlrenderer.util.XRRuntimeException: Can't load the XML resource (using TrAX transformer). org.xml.sax.SAXParseException; lineNumber: 22; columnNumber: 9; The markup in the document following the root element must be well-formed.
    at org.xhtmlrenderer.resource.XMLResource$XMLResourceBuilder.transform(
    at org.xhtmlrenderer.resource.XMLResource$XMLResourceBuilder.createXMLResource(
    at org.xhtmlrenderer.resource.XMLResource.load(
    at org.xhtmlrenderer.pdf.ITextRenderer.setDocumentFromString(
    at org.xhtmlrenderer.pdf.ITextRenderer.setDocumentFromString(
    at Main.generatePDFFrom(
    at Main.main(
Caused by: javax.xml.transform.TransformerException: org.xml.sax.SAXParseException; lineNumber: 22; columnNumber: 9; The markup in the document following the root element must be well-formed.
    at org.xhtmlrenderer.resource.XMLResource$XMLResourceBuilder.transform(
    ... 6 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 22; columnNumber: 9; The markup in the document following the root element must be well-formed.
    ... 8 more

Now, as far as I understood this is because of the "invalid" parts of the html string.

Important notes:

  • The values assigned to variables ok and html here are just a placeholder for the question. Real ones are here.
  • In the real project, the html string is an input that comes from the user. Yes, he/she must know what to put there, but, of course, he/she can do some mistakes in the html conformation, so I have to handle this.


  • Is there any way I can "tell" to Flying Saucer PDF Rendering to ignore/autocomplete/clean itself/or any other, those "invalid" parts and move on with the creation of the PDF file (preferred).
  • Is there a better approach I can use in order to overcome this.


  • Since I had the same issue while using Flying Saucer to generate a PDF from an HTML, I used the HtmlCleaner library (see maven link) to clean the HTML code before parsing into Flying Saucer library.

    // Clean the html to use in the flying saucer converting tool
    // get the element you want to serialize
    HtmlCleaner cleaner = new HtmlCleaner();
    TagNode rootTagNode = cleaner.clean(html);
    // set up properties for the serializer (optional, see online docs)
    CleanerProperties cleanerProperties = cleaner.getProperties();
    // use the getAsString method on an XmlSerializer class
    XmlSerializer xmlSerializer = new PrettyXmlSerializer(cleanerProperties);
    String cleanedHtml = xmlSerializer.getAsString(rootTagNode);
    // use the to convert cleaned HTML to PDF
    ITextRenderer renderer = new ITextRenderer();
    // ....