Search code examples
linuxpdfhyperlinkutilityverify

how to verify links in a PDF file


I have a PDF file which I want to verify whether the links in that are proper. Proper in the sense - all URLs specified are linked to web pages and nothing is broken. I am looking for a simple utility or a script which can do it easily ?!

Example:

$ testlinks my.pdf
There are 2348 links in this pdf.
2322 links are proper.
Remaining broken links and page numbers in which it appears are logged in brokenlinks.txt

I have no idea of whether something like that exists, so googled & searched in stackoverflow also. But did not find anything useful yet. So would like to anyone has any idea about it !

Updated: to make the question clear.


Solution

  • I suggest first using the linux command line utility 'pdftotext' - you can find the man page:

    pdftotext man page

    The utility is part of the Xpdf collection of PDF processing tools, available on most linux distributions. See http://foolabs.com/xpdf/download.html.

    Once installed, you could process the PDF file through pdftotext:

    pdftotext file.pdf file.txt
    

    Once processed, a simple perl script that searched the resulting text file for http URLs, and retrieved them using LWP::Simple. LWP::Simple->get('http://...') will allow you to validate the URLs with a code snippet such as:

    use LWP::Simple;
    $content = get("http://www.sn.no/");
    die "Couldn't get it!" unless defined $content;
    

    That would accomplish what you want to do, I think. There are plenty of resources on how to write regular expressions to match http URLs, but a very simple one would look like this:

    m/http[^\s]+/i
    

    "http followed by one or more not-space characters" - assuming the URLs are property URL encoded.