I would like to download a pdf from this website using R. The problem is that you first have to click on the "Maak een pdf" button on the website. Because this is an javascript onclick attribute. I'm able to find the attribute but I have no idea how to download this pdf file. Here is an screenshot of the element inspection:
Here is the code I tried:
library(tidyverse)
library(rvest)
link = "https://puc.overheid.nl/natuurvergunningen/doc/PUC_746615_17/1/"
button <- link %>%
read_html() %>%
html_nodes(".download-als") %>%
html_nodes("a") %>%
html_attr("href")
button
#> [1] "javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(\"ctl00$cphContent$Main$ctl00$DocumentHeader$ctl00\", \"\", true, \"\", \"\", false, true))"
download.file(button, destfile = "Downloads/test.pdf")
#> Warning in download.file(button, destfile = "Downloads/test.pdf"): URL
#> javascript:WebForm_DoPostBackWithOptions(new
#> WebForm_PostBackOptions("ctl00$cphContent$Main$ctl00$DocumentHeader$ctl00", "",
#> true, "", "", false, true)): cannot open destfile 'Downloads/test.pdf', reason
#> 'No such file or directory'
#> Warning in download.file(button, destfile = "Downloads/test.pdf"): download had
#> nonzero exit status
Created on 2024-02-05 with reprex v2.0.2
I tried to download.file
the file but of course that doesn't work. It seems that we need to use the RSelenium
to create a click action on the button via a browser. I found this question: How to web-scrape on-click information with R? but I can't find a way to do this with an "onclick" attribute. So I was wondering if anyone knows how to download a pdf file from an onclick attribute?
To get to that final download link from the document page, we need to play some request/response ping-pong to mimic javascript application -- first, we'd need to submit a request to the backend, then wait for it to finish and continue with the download.
To recover that exact flow and used endpoint (/PUC/Handlers/ManifestatieService.ashx
), we should focus on Network tab of browser's dev tools (activate it before clicking through download process to record all relevant requests/responses); if there's too much traffic, search and filter can be quite handy:
To implement flow that's close enough, we'll mostly rely on httr2
; rvest
is only used to extract JavaScript function parameters from link's onclick
attribute. Though in this particular case, we could probably extract identifier PUC_746615_17
and kanaal
value (natuurvergunningen
) directly from document URL too.
library(tidyverse)
library(rvest)
library(httr2)
# timestamp helper
timestamp_ <- \() sprintf("%.0f", as.numeric(Sys.time()) * 1000)
# get request parameters --------------------------------------------------
link = "https://puc.overheid.nl/natuurvergunningen/doc/PUC_746615_17/1/"
onclick <-
link %>%
read_html() %>%
html_elements(".download-als a") %>%
html_attr("onclick")
(req_param <- str_extract_all(onclick, "(?<=')[^\\s']+(?=')")[[1]])
#> [1] "PUC_746615_17_1" "natuurvergunningen" "pdf"
# submit request / get ticket ---------------------------------------------
ticket <-
request("https://puc.overheid.nl/PUC/Handlers/ManifestatieService.ashx") %>%
req_url_query(actie = "maakmanifestatie",
kanaal = req_param[2],
identifier = req_param[1],
soort = req_param[3],
`_` = timestamp_()) %>%
req_perform() %>%
resp_body_json(check_type = FALSE)
jsonlite::toJSON(ticket, auto_unbox = TRUE, pretty = TRUE)
#> {
#> "ticket": "70337706-d27d-463e-8b6b-8ca2ba47662d"
#> }
# submit ticket / get url -------------------------------------------------
# it takes few moments for backend to finish our request
Sys.sleep(2)
pdf_url <-
request("https://puc.overheid.nl/PUC/Handlers/ManifestatieService.ashx") %>%
req_url_query(actie = "haalstatus",
ticket = ticket$ticket,
`_` = timestamp_()) %>%
req_perform() %>%
resp_body_json(check_type = FALSE)
jsonlite::toJSON(pdf_url, auto_unbox = TRUE, pretty = TRUE)
#> {
#> "result": {
#> "status": "available",
#> "url": "/puc-opendata/request-result/70337706-d27d-463e-8b6b-8ca2ba47662d/Verlenging%20van%20de%20looptijd%20van%20de%20vergunning%20Wet%20Natuurbescherming%20%28Wnb%29%20voor%20het%20project%20Afsluitdij.pdf",
#> "filename": "Verlenging van de looptijd van de vergunning Wet Natuurbescherming (Wnb) voor het project Afsluitdij.pdf"
#> }
#> }
# download pdf ------------------------------------------------------------
request("https://puc.overheid.nl/PUC/Handlers/ManifestatieService.ashx") %>%
req_url_query(actie = "download",
identifier = req_param[1],
url = pdf_url$result$url,
filename = pdf_url$result$filename) %>%
req_perform(path = pdf_url$result$filename)
#> <httr2_response>
#> GET
#> https://puc.overheid.nl/PUC/Handlers/ManifestatieService.ashx?actie=download&identifier=PUC_746615_17_1&url=%2Fpuc-opendata%2Frequest-result%2F70337706-d27d-463e-8b6b-8ca2ba47662d%2FVerlenging%2520van%2520de%2520looptijd%2520van%2520de%2520vergunning%2520Wet%2520Natuurbescherming%2520%2528Wnb%2529%2520voor%2520het%2520project%2520Afsluitdij.pdf&filename=Verlenging%20van%20de%20looptijd%20van%20de%20vergunning%20Wet%20Natuurbescherming%20%28Wnb%29%20voor%20het%20project%20Afsluitdij.pdf
#> Status: 200 OK
#> Content-Type: application/pdf
#> Body: On disk 'body'
fs::file_info(pdf_url$result$filename)[1:3]
#> # A tibble: 1 × 3
#> path type size
#> <fs::path> <fct> <fs:>
#> 1 …nning Wet Natuurbescherming (Wnb) voor het project Afsluitdij.pdf file 171K
Created on 2024-02-05 with reprex v2.0.2
Alternative approaches would be based on tools that can handle JavaScript, i.e. Chromote or RSelenium, for example. And perhaps webdriver
with PhantomJS.