Search code examples
rweb-scrapingrvestrselenium

Download pdf from javascript onclick attribute using R


I would like to download a pdf from this website using R. The problem is that you first have to click on the "Maak een pdf" button on the website. Because this is an javascript onclick attribute. I'm able to find the attribute but I have no idea how to download this pdf file. Here is an screenshot of the element inspection:

enter image description here

Here is the code I tried:

library(tidyverse)
library(rvest)

link = "https://puc.overheid.nl/natuurvergunningen/doc/PUC_746615_17/1/"

button <- link %>%
  read_html() %>%
  html_nodes(".download-als") %>%
  html_nodes("a") %>%
  html_attr("href") 
button
#> [1] "javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(\"ctl00$cphContent$Main$ctl00$DocumentHeader$ctl00\", \"\", true, \"\", \"\", false, true))"

download.file(button, destfile = "Downloads/test.pdf")
#> Warning in download.file(button, destfile = "Downloads/test.pdf"): URL
#> javascript:WebForm_DoPostBackWithOptions(new
#> WebForm_PostBackOptions("ctl00$cphContent$Main$ctl00$DocumentHeader$ctl00", "",
#> true, "", "", false, true)): cannot open destfile 'Downloads/test.pdf', reason
#> 'No such file or directory'
#> Warning in download.file(button, destfile = "Downloads/test.pdf"): download had
#> nonzero exit status

Created on 2024-02-05 with reprex v2.0.2

I tried to download.file the file but of course that doesn't work. It seems that we need to use the RSelenium to create a click action on the button via a browser. I found this question: How to web-scrape on-click information with R? but I can't find a way to do this with an "onclick" attribute. So I was wondering if anyone knows how to download a pdf file from an onclick attribute?


Solution

  • To get to that final download link from the document page, we need to play some request/response ping-pong to mimic javascript application -- first, we'd need to submit a request to the backend, then wait for it to finish and continue with the download.

    To recover that exact flow and used endpoint (/PUC/Handlers/ManifestatieService.ashx), we should focus on Network tab of browser's dev tools (activate it before clicking through download process to record all relevant requests/responses); if there's too much traffic, search and filter can be quite handy: Chrome dev tools

    To implement flow that's close enough, we'll mostly rely on httr2; rvest is only used to extract JavaScript function parameters from link's onclick attribute. Though in this particular case, we could probably extract identifier PUC_746615_17 and kanaal value (natuurvergunningen) directly from document URL too.

    library(tidyverse)
    library(rvest)
    library(httr2)
    
    # timestamp helper
    timestamp_ <- \() sprintf("%.0f", as.numeric(Sys.time()) * 1000)
    
    # get request parameters --------------------------------------------------
    link = "https://puc.overheid.nl/natuurvergunningen/doc/PUC_746615_17/1/"
    
    onclick <- 
      link %>%
      read_html() %>%
      html_elements(".download-als a") %>% 
      html_attr("onclick")
    
    (req_param <- str_extract_all(onclick, "(?<=')[^\\s']+(?=')")[[1]])
    #> [1] "PUC_746615_17_1"    "natuurvergunningen" "pdf"
    
    # submit request / get ticket ---------------------------------------------
    ticket <- 
      request("https://puc.overheid.nl/PUC/Handlers/ManifestatieService.ashx") %>% 
      req_url_query(actie      = "maakmanifestatie",
                    kanaal     = req_param[2],
                    identifier = req_param[1],
                    soort      = req_param[3],
                    `_`        = timestamp_()) %>% 
      req_perform() %>% 
      resp_body_json(check_type = FALSE)
    
    jsonlite::toJSON(ticket, auto_unbox = TRUE,  pretty = TRUE)
    #> {
    #>   "ticket": "70337706-d27d-463e-8b6b-8ca2ba47662d"
    #> }
    
    # submit ticket / get url -------------------------------------------------
    # it takes few moments for backend to finish our request
    Sys.sleep(2)
    pdf_url <- 
      request("https://puc.overheid.nl/PUC/Handlers/ManifestatieService.ashx") %>% 
      req_url_query(actie      = "haalstatus",
                    ticket     = ticket$ticket,
                    `_`        = timestamp_()) %>% 
      req_perform() %>% 
      resp_body_json(check_type = FALSE)
    
    jsonlite::toJSON(pdf_url, auto_unbox = TRUE,  pretty = TRUE)
    #> {
    #>   "result": {
    #>     "status": "available",
    #>     "url": "/puc-opendata/request-result/70337706-d27d-463e-8b6b-8ca2ba47662d/Verlenging%20van%20de%20looptijd%20van%20de%20vergunning%20Wet%20Natuurbescherming%20%28Wnb%29%20voor%20het%20project%20Afsluitdij.pdf",
    #>     "filename": "Verlenging van de looptijd van de vergunning Wet Natuurbescherming (Wnb) voor het project Afsluitdij.pdf"
    #>   }
    #> }
    
    # download pdf ------------------------------------------------------------
    request("https://puc.overheid.nl/PUC/Handlers/ManifestatieService.ashx") %>% 
      req_url_query(actie = "download",
                    identifier = req_param[1],
                    url = pdf_url$result$url,
                    filename = pdf_url$result$filename) %>% 
      req_perform(path = pdf_url$result$filename)
    #> <httr2_response>
    #> GET
    #> https://puc.overheid.nl/PUC/Handlers/ManifestatieService.ashx?actie=download&identifier=PUC_746615_17_1&url=%2Fpuc-opendata%2Frequest-result%2F70337706-d27d-463e-8b6b-8ca2ba47662d%2FVerlenging%2520van%2520de%2520looptijd%2520van%2520de%2520vergunning%2520Wet%2520Natuurbescherming%2520%2528Wnb%2529%2520voor%2520het%2520project%2520Afsluitdij.pdf&filename=Verlenging%20van%20de%20looptijd%20van%20de%20vergunning%20Wet%20Natuurbescherming%20%28Wnb%29%20voor%20het%20project%20Afsluitdij.pdf
    #> Status: 200 OK
    #> Content-Type: application/pdf
    #> Body: On disk 'body'
    
    fs::file_info(pdf_url$result$filename)[1:3]
    #> # A tibble: 1 × 3
    #>   path                                                               type   size
    #>   <fs::path>                                                         <fct> <fs:>
    #> 1 …nning Wet Natuurbescherming (Wnb) voor het project Afsluitdij.pdf file   171K
    

    Created on 2024-02-05 with reprex v2.0.2

    Alternative approaches would be based on tools that can handle JavaScript, i.e. Chromote or RSelenium, for example. And perhaps webdriver with PhantomJS.