Search code examples
javaitextflying-saucerxmlworker

Generating PDF from a third-party HTML on java


I'm trying to generate a PDF version of a third-party HTML (actually it is an HTM file). This HTML may change in future and I have absolutely no control over it. All I wanna do is convert it to a PDF.

I already tried 2 solutions: iText (with XmlWorker) and Flying-Saucer, but no success so far.

My problem is that the HTML file is very out of default patterns. Examples:

    <link rel=File-List href="040602_inds_files/filelist.xml">

    <meta http-equiv=Content-Type content="text/html; charset=windows-1252">

The first one has no close tag (iText crashes) and the second one has no double quotes on 'http-equiv' value (Flying-Saucer crashes).

I have found a lot of posts about this issue, but all of them are handling their own HTML, so they can fix it and try again. But i can't do this.

This is the page I'm trying to convert.

Here is my iText convert method:

        public static void convert(PdfWriter writer, Document document, String siteUrl) throws MalformedURLException, IOException {
            XMLWorkerHelper.getInstance().parseXHtml(writer, document,
                    new BufferedReader(new InputStreamReader(new URL(siteUrl).openStream())));
        }

And here is my Flying-Saucer convert method:

        public static void convertFS(String siteUrl, String fileName) throws com.lowagie.text.DocumentException, IOException {
            OutputStream os = new FileOutputStream(fileName);
            ITextRenderer renderer = new ITextRenderer();
            renderer.setDocument(siteUrl);
            renderer.layout();
            renderer.createPDF(os);

            os.close();
        }

Any tips? I accept other libs if they are decently usable. Thx in advance.


Solution

  • You can first parse HTML file by jsoup and then convert content to a standard HTML file, finally you can use iText to generate PDF