Search code examples
rgettimeouthttr

How to refresh or retry a specific web page using httr GET command?


I need to access the same web page with different "keys" to get specific content it provides.

I have a list of keys x and I use the GET command from httr package to access the web page and then retrieve the information I need y.

library(httr)
library(stringr)
library(XML)

for (i in 1:20){
    h1 = GET ( paste0("http:....categories=&query=", x[i]),timeout(10))
    par = htmlParse(file = h1)

    y[i]=xpathSApply(doc = par, path = "//h3/a" , fun=xmlValue)

}

The problem is that timeout is often reached, and it disrupts the loop.

So I would like to refresh the web page or retry the GET command if timeout is reached, because I suspect the problem is with the internet connection of the website I am trying to access.

The way my code works, timeout breaks the loop. I need to either ignore the error and go to next iteration or retry to access the website.


Solution

  • Look at purrr::safely(). You can wrap GET as such:

    safe_GET <- purrr::safely(GET)
    

    This removes the ugliness of tryCatch() by letting you do:

    resp <- safe_GET("http://example.com") # you can use all legal `GET` params
    

    And you can test resp$result for NULL. Put that into your retry loop and you're good to go.

    You can see this in action by doing:

    str(safe_GET("https://httpbin.org/delay/3", timeout(1)))
    

    which will ask the httpbin service to wait 3s before responding but set an explicit timeout on the GET request to 1s. I wrapped it in str() to show the result:

    List of 2
     $ result: NULL
     $ error :List of 2
      ..$ message: chr "Timeout was reached"
      ..$ call   : language curl::curl_fetch_memory(url, handle = handle)
      ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"
    

    So, you can even check the message if you need to.