I'm trying to use LWP::Simple in Perl to download a number of PDF documents from the United Nations website (Security Council resolutions, etc.). Yet instead of returning PDFs, I am receiving an HTML error page. Consider this very simple example:
use LWP::Simple;
use strict;
my $url = 'https://documents-dds-ny.un.org/doc/UNDOC/GEN/N16/100/02/PDF/N1610002.pdf';
my $file = 'test.pdf';
getstore($url, $file);
If I then look at the contents of "test.pdf", I find that they are an HTML page.
I have also tried a number of tricks with LWP::UserAgent and even with cURL, but with no success. Any ideas?
Ok, thanks to @SteffenUllrich and @ ikegami for putting me on the right track!
It is indeed a cookie issue. The fix? Open a cookie jar, access the homepage of the site first, then access the PDF once a cookie has been stored in the jar.
This can be done without using HTTP::Cookies. We need to use LWP::UserAgent instead of LWP::Simple, however.
Minimal working example below:
use strict;
use warnings 'all';
use LWP::UserAgent;
my $homeUrl = "https://documents.un.org/prod/ods.nsf/home.xsp";
my $pdfUrl = "https://documents-dds-ny.un.org/doc/UNDOC/GEN/N16/100/02/PDF/N1610002.pdf";
my $pdfOutputName = "test.pdf";
my $browser = LWP::UserAgent->new( cookie_jar => { } );
my $resp;
$resp = $browser->get( $homeUrl );
die $resp->status_line unless $resp->is_success;
$resp = $browser->get( $pdfUrl, ':content_file' => $pdfOutputName );
die $resp->status_line unless $resp->is_success;
This will produce a complete PDF file.