Until a recent website update, I was able to retriebe PDFs stored on a web page of an anti-lockdown movement in Germany, called Demokratischer Widerstand.
Here's the URL as well as what my R Script looks like. The R Script also shows you what problems I am currently facing and what I need help with.
https://demokratischerwiderstand.de
library(tidyverse)
library(RSelenium)
library(rvest)
# Set up firefox profile and specifiy download options, navigate to url
url <- "https://demokratischerwiderstand.de/"
pdfprof <- makeFirefoxProfile(list(
"pdfjs.disabled"=TRUE,
"plugin.scan.plid.all"=FALSE,
"plugin.scan.Acrobat" = "99.0",
"browser.helperApps.neverAsk.saveToDisk"='application/pdf'))
mydriver <- rsDriver(browser=c("firefox"), port=4444L, extraCapabilities=pdfprof, chromever = NULL)
remDr <- mydriver[['client']]
remDr$navigate(url)
# Close popup window
remDr$findElement(using = 'css', value = '.mdi-close')$clickElement()
# Scroll down i times, waiting for the page to load at each time (to display all documents)
for(i in 1:50){
remDr$executeScript(paste("scroll(0,",i*500,");"))
Sys.sleep(2)
}
# SO FAR SO GOOD...
# PROBLEM 1: The code below doesn't give me the URLs to the documents anymore
remDr$getPageSource() %>%
unlist() %>%
read_html() %>%
html_nodes("a") %>%
html_attr("href") %>%
str_subset("\\.pdf") -> dw_pdfs
# PROBLEM 2: Even if I had the document URLs, this would probably no longer work, because every time I manually click on a document, another window pops up (would probably need to be closed each time using CSS = ".v-btn__content").
for(i in seq_along(dw_pdfs)) {
download.file(dw_pdfs[i], here::here("downloaded_pdfs", paste0("widerstand_",
i*-1+length(dw_pdfs)+1, ".pdf")), mode="wb")
}
Sorry that I cannot make my question general enough to appeal to a broader audience, but to me this seems to be a very specific website issue. Happy to turn it into a more general question, once I know what the issue is and how to fix it.
Thank you for any help!
I know the OP explicitly requested a solution in RSelenium.
Nonetheless, I offer a curl based solution that calls the RESTfull API that stores the particulars of the PDFs in a json document. Compared to Selenium web automation, this solution is likely faster for the OP, and less resource-consuming for the Website.
library(tidyverse)
library(jsonlite)
# Step 1. Read the json that holds a table with the PDF's particulars
json <- fromJSON(txt = 'https://archiv.demokratischerwiderstand.de/api/newspapers', simplifyMatrix = TRUE)
# Step 2. Parse the json into a table. Look for variables with url and the name of the pdf
df_pdfs <-
tibble(id = json$data$id, type = json$data$type, json$data$attributes)
# Step 3. Write a function to gently loop through the download process.
fx_pace_download_pdfs <- function(pdf_url, savefilename, folder = "", sleep = 10) {
print(paste0("processing: ", pdf_url))
savefilename <- paste0(folder, savefilename)
download.file(url = pdf_url, destfile = savefilename, mode = "wb")
Sys.sleep(sleep)
}
# Final step: Download your files by providing the url and a name.
walk2(.x = df_pdfs$fileUrl, .y = df_pdfs$fileName, .f = fx_pace_download_pdfs)