Search code examples
rhttpweb-scrapinghttp-status-codes

Create function to avoid url error in R for loop


I am looping through a .csv filled with urls to scrape a website (authorizing scraping).

I was using a trycatch function to try to avoid breaks in my for loop. But I noticed it stops for some urls (using download.file).

So I am now using a « is this a valid url? » function taken from this post: [Scrape with a loop and avoid 404 error

url_works <- function(url){
tryCatch(
    identical(status_code(HEAD(url)),200L), 
    error = function(e){
        FALSE
    })
}

But even with this function, and looping only if outcome of the function is TRUE, at some point my loop breaks on some urls and I get the following error:

> HTTP status was '500 Internal Server Error'

I would like to understand this error so that I add this case in the URL function to ignore in case of this url type comes out again.

Any thoughts ? Thanks !


Solution

  • Your tryCatch syntax is wrong, I also changed the error message to print the error:

    A generic tryCatch looks like:

    tryCatch({
        operation-you-want-to-try
       }, error = function(e) do-this-on-error
    )
    

    So for your code:

    url_works <- function(url){
        tryCatch({
            s1 <- status_code(HEAD(url))
            }, error = function(e) print(paste0(url, " ", as.character(e)))
        )
        identical(s1, 200L)
    }