Search code examples
javapdfjsoupflying-sauceropenhtmltopdf

openhtmltopdf / flying saucer: many links in huge PDF are not clickable (PDF annotations not set)


I generate huge catalogs (~ 1500 pages) as HTML and convert is via Jsoup to and openhtmltopdf (which uses flying saucer) to PDF. In the resulting PDF many links are not clickable, and I can't find out why.

Consider the following program:

import org.jsoup.helper.W3CDom;
import org.w3c.dom.Document;
import org.jsoup.Jsoup;
import com.openhtmltopdf.pdfboxout.PdfRendererBuilder;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;

public class Main {

    public static void main(String[] args) throws Exception {

        PdfRendererBuilder pdfBuilder = new PdfRendererBuilder();

        String html = "<html><head></head><body>";
        for (Integer i = 0; i < 10000; i++) {
            html += "<a href='http://www.google.de?q=" + i + "'>blabla</a>    <br>";
        }
        html += "</body></html>";

        File file = new File("/tmp/tmp.pdf");
        FileOutputStream fop = new FileOutputStream(file);

        W3CDom w3cDom = new W3CDom();
        Document w3cDoc = w3cDom.fromJsoup(Jsoup.parse(html));

        pdfBuilder.withW3cDocument(w3cDoc, "/");
        pdfBuilder.toStream(fop);
        try {
            pdfBuilder.run();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

It creates a PDF with 176 pages and 10.000 links. On page 1 to 3 they are clickable, afterwards they are not, although identical. The last clickable link is the number 112 and in the source code I find:

870 0 obj
<<
/W 0.0
/S /S
>>
endobj
871 0 obj
<<
/S /URI
/URI (http://www.google.de?q=111)
>>
endobj
872 0 obj
<<
/W 0.0
/S /S
>>
endobj
873 0 obj
<<
/S /URI
/URI (http://www.google.de?q=112)
>>
endobj
874 0 obj
<<
/W 0.0
/S /S
>>
endobj
875 0 obj
<<
/F1 1049 0 R
>>
endobj

Apparently after number 112 there are no URLs stored anymore in the annotation objects.

My Program is much more complicated naturally. On the first five or six pages of it's result all the links are clickable, after that some are and most are not. Which ones are still clickable seems to be completely random though.

Can anyone help here? Any Idea what may cause this issue or how to fix it? A bug in openhtmltopdf?

--

edit 1: Using withHtmlContent instead of withW3cDocument has the same problem.


Solution

  • The generated PDF works perfectly with jsoup 1.11.2 and openhtmltopdf-pdfbox-0.0.1-RC11.

    The problem is likely caused by a bug in an older version of openhtmltopdf, which has been fixed.