I am trying to download ~500 individual PDF submissions from this government webpage using rvest
. Many of the links on the site point to PDFs stored on a separate AWS site (for example this document - see links from the 'Individual submissions' section onwards).
When I download the PDFs, I can't open them. I don't think I am actually downloading the linked PDFs from the AWS site. The links don't include a .pdf file type (e.g. https://getinvolved.mdba.gov.au/22346/widgets/139364/documents/47013) and I think I'm missing a step to download the actual PDFs.
#load packages
library(tidyverse)
library(rvest)
library(polite)
# scrape PDF links and names
mdba_NB_url <- "https://getinvolved.mdba.gov.au/bp-amendments-submissions/widgets/139364/documents"
session <- bow(mdba_NB_url, force = TRUE) # from the polite package, identify and respect any explicit limits
NB_page <- scrape(session) # scrape the page contents
download_links <- tibble(link_names = NB_page %>% #download links
html_nodes("a")%>%
html_text(),
link_urls = NB_page %>%
html_nodes("a") %>%
html_attr('href'))
#filter PDFs
download_links_docs <- download_links %>%. #limit links to PDFs I need
filter(str_detect(link_names, "No. [0-9]"))
download_links_docs_subset <- download_links_docs %>%. #subset for test download
slice(c(1:10))
# Download PDFs
my_urls <- download_links_docs_subset$link_urls
save_here <- paste0(download_links_docs_subset$link_names, ".pdf")
mapply(download.file, my_urls, save_here, mode = "wb")
The link is indeed somehow redirected. But you can relatively easily fix it. If you look at the network analysis when it downloads an actual file, you just need to append "/download" to your url.
e.g. so:
my_urls <- paste0(download_links_docs_subset$link_urls,"/download")
You can then download them using httr
. download.file
seems to mess with the PDF encoding.
Like so:
httr::GET(my_urls[1],
httr::write_disk("test.pdf", overwrite = T))