Search code examples
rweb-scrapingrvest

How can I download PDFs from a website that stores them on AWS using rvest in R


Problem downloading pdfs from a website that stores them on AWS using rvest

I am trying to download ~500 individual PDF submissions from this government webpage using rvest. Many of the links on the site point to PDFs stored on a separate AWS site (for example this document - see links from the 'Individual submissions' section onwards).

When I download the PDFs, I can't open them. I don't think I am actually downloading the linked PDFs from the AWS site. The links don't include a .pdf file type (e.g. https://getinvolved.mdba.gov.au/22346/widgets/139364/documents/47013) and I think I'm missing a step to download the actual PDFs.

Here is a reproducible example

   #load packages

    library(tidyverse)
    library(rvest)
    library(polite)
    
    # scrape PDF links and names

    mdba_NB_url <- "https://getinvolved.mdba.gov.au/bp-amendments-submissions/widgets/139364/documents"
    
    session <- bow(mdba_NB_url, force = TRUE) # from the polite package, identify and respect any explicit limits
    
    NB_page <- scrape(session) # scrape the page contents
    
    download_links <- tibble(link_names = NB_page %>% #download links
                               html_nodes("a")%>%
                               html_text(),
                             link_urls = NB_page %>%
                               html_nodes("a") %>%
                               html_attr('href'))
    #filter PDFs

    download_links_docs <- download_links %>%. #limit links to PDFs I need
      filter(str_detect(link_names, "No. [0-9]"))
    
    download_links_docs_subset <- download_links_docs %>%. #subset for test download
      slice(c(1:10))
    
    # Download PDFs

    my_urls <- download_links_docs_subset$link_urls
    save_here <- paste0(download_links_docs_subset$link_names, ".pdf")
    mapply(download.file, my_urls, save_here, mode = "wb")

Solution

  • The link is indeed somehow redirected. But you can relatively easily fix it. If you look at the network analysis when it downloads an actual file, you just need to append "/download" to your url.

    e.g. so:

    my_urls <- paste0(download_links_docs_subset$link_urls,"/download")
    

    You can then download them using httr. download.file seems to mess with the PDF encoding.

    Like so:

    httr::GET(my_urls[1], 
    httr::write_disk("test.pdf", overwrite = T))