Search code examples
perlpdflwp

Unable to download PDFs with Perl and LWP


I'm trying to use LWP::Simple in Perl to download a number of PDF documents from the United Nations website (Security Council resolutions, etc.). Yet instead of returning PDFs, I am receiving an HTML error page. Consider this very simple example:

use LWP::Simple;
use strict;

my $url = 'https://documents-dds-ny.un.org/doc/UNDOC/GEN/N16/100/02/PDF/N1610002.pdf';
my $file = 'test.pdf';
getstore($url, $file);

If I then look at the contents of "test.pdf", I find that they are an HTML page.

I have also tried a number of tricks with LWP::UserAgent and even with cURL, but with no success. Any ideas?


Solution

  • Ok, thanks to @SteffenUllrich and @ ikegami for putting me on the right track!

    It is indeed a cookie issue. The fix? Open a cookie jar, access the homepage of the site first, then access the PDF once a cookie has been stored in the jar.

    This can be done without using HTTP::Cookies. We need to use LWP::UserAgent instead of LWP::Simple, however.

    Minimal working example below:

    use strict;
    use warnings 'all';
    
    use LWP::UserAgent;
    
    my $homeUrl       = "https://documents.un.org/prod/ods.nsf/home.xsp"; 
    my $pdfUrl        = "https://documents-dds-ny.un.org/doc/UNDOC/GEN/N16/100/02/PDF/N1610002.pdf";
    my $pdfOutputName = "test.pdf"; 
    
    my $browser = LWP::UserAgent->new( cookie_jar => { } );
    
    my $resp;
    
    $resp = $browser->get( $homeUrl );
    die $resp->status_line unless $resp->is_success;
    
    $resp = $browser->get( $pdfUrl, ':content_file' => $pdfOutputName );
    die $resp->status_line unless $resp->is_success;
    

    This will produce a complete PDF file.