I'm trying to programmatically scrape the files from this page: https://olms.dol-esa.gov/query/getYearlyData.do (yes it probably would be faster to download them manually but I want to learn how to do this).
I have the following bit of code to try to attempt this on one of the files as a test:
#!/usr/bin/perl
use strict;
use warnings;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new;
$mech->get( 'https://olms.dol-esa.gov/query/getYearlyData.do' );
print $mech->uri();
$mech->submit_form( with_fields => { selectedFileName => '/filer/local/cas/YearlyDataDump/2000.zip' } );
When I run the code, nothing happens. Nothing gets downloaded. Thinking javascript might be the problem, I also tried the same code with WWW::Mechanize::Firefox. Again, nothing happens when I run the code.
I also don't see the paths to the files. It's probably obscured in some javascript.
So what's the best way to get these files? Is it possible to get them without javascript?
While the comments by ThisSuitIsBlackNot are spot on, there is a rather simple way of doing this programmatically without using JS at all. You don't even need WWW::Mechanize.
I've used Web::Scraper to find all the files. As you said, the form values are there. It's a matter of scraping them out. WWW::Mechanize is good at navigating, but not very good at scraping. Web::Scraper's interface on the other hand is really easy.
Once we have the files, all we need to do is submit a POST request with the correct form values. This is pretty similar to WWW::Mechanize's submit_form
. In fact, WWW::Mechanize is an LWP::UserAgent under the hood, and all we need is a request, so we ca use it directly.
The :content_file
option on the post
method tells it to put the response into a file. It will do the right thing with the ZIP file and write it as binary automatically.
use strict;
use warnings;
use LWP::UserAgent;
use Web::Scraper;
use URI;
# build a Web::Scraper to find all files on the page
my $files = scraper {
process 'form[name="yearlyDataForm"]', 'action' => '@action';
process 'input[name="selectedFileName"]', 'files[]' => '@value';
};
# get the files and the form action
my $res = $files->scrape( URI->new('https://olms.dol-esa.gov/query/getYearlyData.do') );
# use LWP to download them one by one
my $ua = LWP::UserAgent->new;
foreach my $path ( @{ $res->{files} } ) {
# the file will end up relative to the current working directory (.)
( my $filename ) = ( split '/', $path )[-1];
# the submit is hardcoded, but that could be dynamic as well
$ua->post(
$res->{action},
{ selectedFileName => $path, submitButton => 'Download' },
':content_file' => $filename # this downloads the file
);
}
Once you run this, you'll have all the files in the directory of your script. It will take a moment and there is no output, but it works.
You need to make sure to include the submit button in the form.
Since you wanted to learn how to do something like this, I've built it slightly dynamic. The form action gets scraped as well, so you could reuse this on similar forms that use the same form names (or make that an argument) and you'd not have to care about the form action. The same thing could also be done with the submit button, but you'd need to grab both the name
and the value
attribute.
I'll repeat what ThisSuitIsBlackNot said in their comment though: Scraping a website always comes with the risk that it changes later! For a one-time thing that doesn't matter, if you would want to run this as a cronjob once a year it might already fail next year because they finally updated their website to be a bit more modern.