Search code examples
rrcurlhttrxml2

Recognize forwarding when checking if url exists


I have some URLs I need to crawl. I do not know how many there are. That is why I just iterate through them.

Unfortunately when a page does not exists xml2::read.html gets an error that breaks my loop. When I use Rcurl::url.exists or httr::http_error to check if HTML exists I get no indication the pages aren't there because I get forwarded.

> url <- "https://zoek.officielebekendmakingen.nl/h-tk-20152016-1-6"
> xml2::read_html(url)
Error in open.connection(x, "rb") : HTTP error 404.
> url.exists(url)
[1] TRUE
> httr::http_error(url)
[1] FALSE

The URL should produce an error (which it does for xml2) but both RCurl and httr get no indication the site isn't there.

I use the following options for RCurl

options(RCurlOptions = list(verbose = FALSE,
                            followlocation = FALSE,
                            autoreferer = FALSE,
                            nosignal = TRUE))

Any idea how to move forward?


Solution

  • That's because this server returns 200 OK when you send a HEAD request (like url.exists() and http_error() do). When send a GET request you receive the 404 NOT FOUND.

    So you can do

    httr::http_error(httr::GET(url))
    #> TRUE
    

    Even better, you can save the result of the GET request and process it's content. This way you only need one request in any case. If there is an error you skip it, otherwise you process the result (e.g. with xml2 or whatever you use)