I have some URLs I need to crawl. I do not know how many there are. That is why I just iterate through them.
Unfortunately when a page does not exists xml2::read.html
gets an error that breaks my loop. When I use Rcurl::url.exists
or httr::http_error
to check if HTML exists I get no indication the pages aren't there because I get forwarded.
> url <- "https://zoek.officielebekendmakingen.nl/h-tk-20152016-1-6"
> xml2::read_html(url)
Error in open.connection(x, "rb") : HTTP error 404.
> url.exists(url)
[1] TRUE
> httr::http_error(url)
[1] FALSE
The URL should produce an error (which it does for xml2) but both RCurl and httr get no indication the site isn't there.
I use the following options for RCurl
options(RCurlOptions = list(verbose = FALSE,
followlocation = FALSE,
autoreferer = FALSE,
nosignal = TRUE))
Any idea how to move forward?
That's because this server returns 200 OK
when you send a HEAD request (like url.exists()
and http_error()
do). When send a GET request you receive the 404 NOT FOUND
.
So you can do
httr::http_error(httr::GET(url))
#> TRUE
Even better, you can save the result of the GET request and process it's content
. This way you only need one request in any case. If there is an error you skip it, otherwise you process the result (e.g. with xml2
or whatever you use)