Search code examples
seleniumperlselenium-webdriverselenium-chromedriver

Perl : Scrape website and how to download PDF files from the website using Perl Selenium:Chrome


So I'm studying Scraping website using Selenium:Chrome on Perl, I just wondering how can I download all pdf files from year 2017 to 2021 and store it into a folder from this website https://www.fda.gov/drugs/warning-letters-and-notice-violation-letters-pharmaceutical-companies/untitled-letters-2021 . So far this is what I've done

use strict;
use warnings;
use Time::Piece;
use POSIX qw(strftime);
use Selenium::Chrome;
use File::Slurp;
use File::Copy qw(copy);
use File::Path;
use File::Path qw(make_path remove_tree);
use LWP::Simple;


my $collection_name = "mre_zen_test3";
make_path("$collection_name");

#DECLARE SELENIUM DRIVER
my $driver = Selenium::Chrome->new;

#NAVIGATE TO SITE
print "trying to get toc_url\n";
$driver->navigate('https://www.fda.gov/drugs/warning-letters-and-notice-violation-letters-pharmaceutical-companies/untitled-letters-2021');
sleep(8);

#GET PAGE SOURCE
my $toc_content = $driver->get_page_source();
$toc_content =~ s/[^\x00-\x7f]//g;
write_file("toc.html", $toc_content);
print "writing toc.html\n";
sleep(5);
$toc_content = read_file("toc.html");

This script only download the entire content of the website. Hope someone here can help me and teach me. Thank you very much.


Solution

  • Here is some working code, to help you get going hopefully

    use warnings;
    use strict;
    use feature 'say';
    use Path::Tiny;  # only convenience
    
    use Selenium::Chrome;
    
    my $base_url = q(https://www.fda.gov/drugs/)
        . q(warning-letters-and-notice-violation-letters-pharmaceutical-companies/);
    
    my $show = 1;  # to see navigation. set to false for headless operation
        
    # A little demo of how to set some browser options
    my %chrome_capab = do {
        my @cfg = ($show) 
            ? ('window-position=960,10', 'window-size=950,1180')
            : 'headless';
        'extra_capabilities' => { 'goog:chromeOptions' => { args => [ @cfg ] } }
    };
    
    my $drv = Selenium::Chrome->new( %chrome_capab );
    
    my @years = 2017..2021;
    foreach my $year (@years) {
        my $url = $base_url . "untitled-letters-$year";
    
        $drv->get($url);
    
        say "\nPage title: ", $drv->get_title;
        sleep 1 if $show;
    
        my $elem = $drv->find_element(
            q{//li[contains(text(), 'PDF')]/a[contains(text(), 'Untitled Letter')]}
        );
        sleep 1 if $show;
        
        # Downloading the file is surprisingly not simple with Selenium itself 
        # (see text). But as we found the link we can get its url and then use 
        # Selenium-provided user-agent (it's LWP::UserAgent)
        my $href = $elem->get_attribute('href');
        say "pdf's url: $href";
    
        my $response = $drv->ua->get($href);
        die $response->status_line if not $response->is_success;
    
        say "Downloading 'Content-Type': ", $response->header('Content-Type'); 
        my $filename = "download_$year.pdf";
        say "Save as $filename";
        path($filename)->spew( $response->decoded_content );
    }
    

    This takes shortcuts, switches approaches, and sidesteps some issues (which one need resolve for a fuller utility of this useful tool). It downloads one pdf from each page; to download all we need to change the XPath expression used to locate them

    my @hrefs = 
        map { $_->get_attribute('href') } 
        $drv->find_elements(
            # There's no ends-with(...) in XPath 1.0 (nor matches() with regex)
            q{//li[contains(text(), '(PDF)')]}
          . q{/a[starts-with(@href, '/media/') and contains(@href, '/download')]} 
        );
    

    Now loop over the links, forming filenames more carefully, and download each like in the program above. I can fill the gaps further if there's need for that.

    The code puts the pdf files on disk, in its working directory. Please review that before running this so to make sure that nothing gets overwritten!

    See Selenium::Remote::Driver for starters.


    Note: there is no need for Selenium for this particular task; it's all straight-up HTTP requests, no JavaScript. So LWP::UserAgent or Mojo would do it just fine. But I take it that you want to learn how to use Selenium, since it often is needed and is useful.