I would like to web scrape some pdf documents which are created with a onclick button. In my previous question @margusl gave a really great answer to download these pdfs. Unfortunately this doesn't work for all urls from the same site. It returns an error and I have no clue why this happens. Here is some reproducible code:
# timestamp helper
timestamp_ <- \() sprintf("%.0f", as.numeric(Sys.time()) * 1000)
# Web scrape the main page
link = "https://puc.overheid.nl/rsj/wettelijkkader/pagina/G/-/gdlv/1/"
page = read_html(link)
# Get link of all sub pages
sub_pages <- page %>%
html_nodes(".paging") %>%
html_nodes("a") %>%
# Change to right link and Get every nth link (there a dups)
links_subpages <- paste0("https://puc.overheid.nl", sub_pages)
links <- links_subpages[1] %>%
read_html() %>%
html_nodes(".search_result") %>%
html_nodes("a") %>%
# correct url link
links <- paste0("https://puc.overheid.nl", links[seq(1, length(links), 2)])
onclick <- links[1] %>%
read_html() %>%
html_nodes(".download-als a") %>%
(req_param <- str_extract_all(onclick, "(?<=')[^\\s']+(?=')")[[1]])
#> [1] "PUC_750817_21_1" "rsj" "pdf"
# submit request / get ticket ---------------------------------------------
ticket <-
request("https://puc.overheid.nl/PUC/Handlers/ManifestatieService.ashx") %>%
req_url_query(actie = "maakmanifestatie",
kanaal = req_param[2],
identifier = req_param[1],
soort = req_param[3],
`_` = timestamp_()) %>%
req_perform() %>%
resp_body_json(check_type = FALSE)
jsonlite::toJSON(ticket, auto_unbox = TRUE, pretty = TRUE)
#> {
#> "ticket": "3c4826c1-6949-4c0f-b476-d252c4713b37"
#> }
pdf_url <-
request("https://puc.overheid.nl/PUC/Handlers/ManifestatieService.ashx") %>%
req_url_query(actie = "haalstatus",
ticket = ticket$ticket,
`_` = timestamp_()) %>%
req_perform() %>%
resp_body_json(check_type = FALSE)
jsonlite::toJSON(pdf_url, auto_unbox = TRUE, pretty = TRUE)
#> {
#> "result": {
#> "status": "available",
#> "url": "/puc-opendata/request-result/3c4826c1-6949-4c0f-b476-d252c4713b37/RSJ%2023%2F31577%2FGA%2012%20december%202023%20beroep.pdf",
#> "filename": "RSJ 23/31577/GA 12 december 2023 beroep.pdf"
#> }
#> }
# download pdf ------------------------------------------------------------
request("https://puc.overheid.nl/PUC/Handlers/ManifestatieService.ashx") %>%
req_url_query(actie = "download",
identifier = req_param[1],
url = pdf_url$result$url,
filename = pdf_url$result$filename) %>%
req_perform(path = pdf_url$result$filename)
#> Error:
#> ! Failed to open file RSJ 23/31577/GA 12 december 2023 beroep.pdf.
#> Backtrace:
#> ▆
#> 1. ├─... %>% req_perform(path = pdf_url$result$filename)
#> 2. └─httr2::req_perform(., path = pdf_url$result$filename)
#> 3. └─base::tryCatch(...)
#> 4. └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#> 5. └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#> 6. └─value[[3L]](cond)
Created on 2024-02-07 with reprex v2.0.2
So it returns the following error:
! Failed to open file RSJ 23/31577/GA 12 december 2023 beroep.pdf.
Run `rlang::last_trace()` to see where the error occurred.
It seems that for some urls it doesn't work. I have no idea why this happens. I tried to change the Sys.sleep
variable but this also doesn't work. So I was wondering if anyone knows why this happens for some request
The problem was that the file you were writing to has name "RSJ 23/31577/GA 12 december 2023 beroep.pdf"
, which contains slashes and spaces. Not good (but not your fault: that's the name they're using!).
You can just replace those naughty characters though.
request("https://puc.overheid.nl/PUC/Handlers/ManifestatieService.ashx") %>%
req_url_query(actie = "download",
identifier = req_param[1],
url = pdf_url$result$url,
filename = pdf_url$result$filename) %>%
req_perform(path = pdf_url$result$filename %>% gsub("[ /]", "-", .))
That seems to do the job:
Status: 200 OK
Content-Type: application/pdf
Body: On disk RSJ-23-31577-GA-12-december-2023-beroep.pdf (73534 bytes)
The file is saved to RSJ-23-31577-GA-12-december-2023-beroep.pdf