Search code examples

how to verify links in a PDF file

I have a PDF file which I want to verify whether the links in that are proper. Proper in the sense - all URLs specified are linked to web pages and nothing is broken. I am looking for a simple utility or a script which can do it easily ?!


$ testlinks my.pdf
There are 2348 links in this pdf.
2322 links are proper.
Remaining broken links and page numbers in which it appears are logged in brokenlinks.txt

I have no idea of whether something like that exists, so googled & searched in stackoverflow also. But did not find anything useful yet. So would like to anyone has any idea about it !

Updated: to make the question clear.


  • I suggest first using the linux command line utility 'pdftotext' - you can find the man page:

    pdftotext man page

    The utility is part of the Xpdf collection of PDF processing tools, available on most linux distributions. See

    Once installed, you could process the PDF file through pdftotext:

    pdftotext file.pdf file.txt

    Once processed, a simple perl script that searched the resulting text file for http URLs, and retrieved them using LWP::Simple. LWP::Simple->get('http://...') will allow you to validate the URLs with a code snippet such as:

    use LWP::Simple;
    $content = get("");
    die "Couldn't get it!" unless defined $content;

    That would accomplish what you want to do, I think. There are plenty of resources on how to write regular expressions to match http URLs, but a very simple one would look like this:


    "http followed by one or more not-space characters" - assuming the URLs are property URL encoded.