Search code examples
rweb-scrapingrselenium

Using RSelenium close specific popup windows in order to download PDFs


Until a recent website update, I was able to retriebe PDFs stored on a web page of an anti-lockdown movement in Germany, called Demokratischer Widerstand.

Here's the URL as well as what my R Script looks like. The R Script also shows you what problems I am currently facing and what I need help with.

https://demokratischerwiderstand.de

library(tidyverse)
library(RSelenium)
library(rvest)

# Set up firefox profile and specifiy download options, navigate to url
url <- "https://demokratischerwiderstand.de/"

pdfprof <- makeFirefoxProfile(list(
  "pdfjs.disabled"=TRUE,
  "plugin.scan.plid.all"=FALSE,
  "plugin.scan.Acrobat" = "99.0",
  "browser.helperApps.neverAsk.saveToDisk"='application/pdf'))

mydriver <- rsDriver(browser=c("firefox"), port=4444L, extraCapabilities=pdfprof, chromever = NULL)
remDr <- mydriver[['client']]
remDr$navigate(url)

# Close popup window
remDr$findElement(using = 'css', value = '.mdi-close')$clickElement()

# Scroll down i times, waiting for the page to load at each time (to display all documents)
for(i in 1:50){      
  remDr$executeScript(paste("scroll(0,",i*500,");"))
  Sys.sleep(2)    
}

# SO FAR SO GOOD...

# PROBLEM 1: The code below doesn't give me the URLs to the documents anymore
remDr$getPageSource() %>% 
  unlist() %>% 
  read_html() %>%
  html_nodes("a") %>%
  html_attr("href") %>%
  str_subset("\\.pdf") -> dw_pdfs

# PROBLEM 2: Even if I had the document URLs, this would probably no longer work, because every time I manually click on a document, another window pops up (would probably need to be closed each time using CSS = ".v-btn__content").
for(i in seq_along(dw_pdfs)) {
  download.file(dw_pdfs[i], here::here("downloaded_pdfs", paste0("widerstand_", 
i*-1+length(dw_pdfs)+1, ".pdf")), mode="wb")
}

Sorry that I cannot make my question general enough to appeal to a broader audience, but to me this seems to be a very specific website issue. Happy to turn it into a more general question, once I know what the issue is and how to fix it.

Thank you for any help!


Solution

  • I know the OP explicitly requested a solution in RSelenium.

    Nonetheless, I offer a curl based solution that calls the RESTfull API that stores the particulars of the PDFs in a json document. Compared to Selenium web automation, this solution is likely faster for the OP, and less resource-consuming for the Website.

    library(tidyverse)
    library(jsonlite)
    
    # Step 1. Read the json that holds a table with the PDF's particulars
    
    json <- fromJSON(txt = 'https://archiv.demokratischerwiderstand.de/api/newspapers', simplifyMatrix = TRUE)
    
    # Step 2. Parse the json into a table. Look for variables with url and the name of the pdf
    df_pdfs <- 
      tibble(id = json$data$id, type = json$data$type, json$data$attributes)
    
    # Step 3. Write a function to gently loop through the download process.
    fx_pace_download_pdfs <- function(pdf_url, savefilename, folder = "", sleep = 10) {
      
      print(paste0("processing: ", pdf_url))
      savefilename <- paste0(folder, savefilename)
      download.file(url = pdf_url, destfile = savefilename, mode = "wb")
      Sys.sleep(sleep)      
    }
    
    # Final step: Download your files by providing the url and a name. 
    walk2(.x = df_pdfs$fileUrl, .y = df_pdfs$fileName, .f = fx_pace_download_pdfs)