Search code examples
rweb-scrapingrvesthttr

Request httr2 to download pdf doesn't work


I would like to web scrape some pdf documents which are created with a onclick button. In my previous question @margusl gave a really great answer to download these pdfs. Unfortunately this doesn't work for all urls from the same site. It returns an error and I have no clue why this happens. Here is some reproducible code:

library(tidyverse)
library(rvest)
library(httr2)

# timestamp helper
timestamp_ <- \() sprintf("%.0f", as.numeric(Sys.time()) * 1000)

# Web scrape the main page
link = "https://puc.overheid.nl/rsj/wettelijkkader/pagina/G/-/gdlv/1/"

page = read_html(link)

# Get link of all sub pages
sub_pages <- page %>% 
  html_nodes(".paging") %>%
  html_nodes("a") %>%
  html_attr("href")

# Change to right link and Get every nth link (there a dups)
links_subpages <- paste0("https://puc.overheid.nl", sub_pages)

links <- links_subpages[1] %>%
  read_html() %>%
  html_nodes(".search_result") %>%
  html_nodes("a") %>%
  html_attr("href")

# correct url link
links <- paste0("https://puc.overheid.nl", links[seq(1, length(links), 2)])

onclick <- links[1] %>%
  read_html() %>%
  html_nodes(".download-als a") %>%
  html_attr("onclick") 

(req_param <- str_extract_all(onclick, "(?<=')[^\\s']+(?=')")[[1]])
#> [1] "PUC_750817_21_1" "rsj"             "pdf"

# submit request / get ticket ---------------------------------------------
ticket <- 
  request("https://puc.overheid.nl/PUC/Handlers/ManifestatieService.ashx") %>% 
  req_url_query(actie      = "maakmanifestatie",
                kanaal     = req_param[2],
                identifier = req_param[1],
                soort      = req_param[3],
                `_`        = timestamp_()) %>% 
  req_perform() %>% 
  resp_body_json(check_type = FALSE)

jsonlite::toJSON(ticket, auto_unbox = TRUE,  pretty = TRUE)
#> {
#>   "ticket": "3c4826c1-6949-4c0f-b476-d252c4713b37"
#> }

Sys.sleep(5)
pdf_url <- 
  request("https://puc.overheid.nl/PUC/Handlers/ManifestatieService.ashx") %>% 
  req_url_query(actie      = "haalstatus",
                ticket     = ticket$ticket,
                `_`        = timestamp_()) %>% 
  req_perform() %>% 
  resp_body_json(check_type = FALSE)

jsonlite::toJSON(pdf_url, auto_unbox = TRUE,  pretty = TRUE)
#> {
#>   "result": {
#>     "status": "available",
#>     "url": "/puc-opendata/request-result/3c4826c1-6949-4c0f-b476-d252c4713b37/RSJ%2023%2F31577%2FGA%2012%20december%202023%20beroep.pdf",
#>     "filename": "RSJ 23/31577/GA 12 december 2023 beroep.pdf"
#>   }
#> }

# download pdf ------------------------------------------------------------
request("https://puc.overheid.nl/PUC/Handlers/ManifestatieService.ashx") %>% 
  req_url_query(actie = "download",
                identifier = req_param[1],
                url = pdf_url$result$url,
                filename = pdf_url$result$filename) %>% 
  req_perform(path = pdf_url$result$filename)
#> Error:
#> ! Failed to open file RSJ 23/31577/GA 12 december 2023 beroep.pdf.
#> Backtrace:
#>     ▆
#>  1. ├─... %>% req_perform(path = pdf_url$result$filename)
#>  2. └─httr2::req_perform(., path = pdf_url$result$filename)
#>  3.   └─base::tryCatch(...)
#>  4.     └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>  5.       └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>  6.         └─value[[3L]](cond)

Created on 2024-02-07 with reprex v2.0.2

So it returns the following error:

Error:
! Failed to open file RSJ 23/31577/GA 12 december 2023 beroep.pdf.
Run `rlang::last_trace()` to see where the error occurred.

It seems that for some urls it doesn't work. I have no idea why this happens. I tried to change the Sys.sleep variable but this also doesn't work. So I was wondering if anyone knows why this happens for some requests?


Solution

  • The problem was that the file you were writing to has name "RSJ 23/31577/GA 12 december 2023 beroep.pdf", which contains slashes and spaces. Not good (but not your fault: that's the name they're using!).

    You can just replace those naughty characters though.

    request("https://puc.overheid.nl/PUC/Handlers/ManifestatieService.ashx") %>% 
      req_url_query(actie = "download",
                    identifier = req_param[1],
                    url = pdf_url$result$url,
                    filename = pdf_url$result$filename) %>% 
      req_perform(path = pdf_url$result$filename %>% gsub("[ /]", "-", .))
    

    That seems to do the job:

    <httr2_response>
    GET
    https://puc.overheid.nl/PUC/Handlers/ManifestatieService.ashx?actie=download&identifier=PUC_750817_21_1&url=%2Fpuc-opendata%2Frequest-result%2Fa8898f6b-1846-4859-b642-3fc95a923069%2FRSJ%252023%252F31577%252FGA%252012%2520december%25202023%2520beroep.pdf&filename=RSJ%2023%2F31577%2FGA%2012%20december%202023%20beroep.pdf
    Status: 200 OK
    Content-Type: application/pdf
    Body: On disk RSJ-23-31577-GA-12-december-2023-beroep.pdf (73534 bytes)
    

    The file is saved to RSJ-23-31577-GA-12-december-2023-beroep.pdf.