Search code examples
rurlrcurlhttr

Check if URL exists in R


I want to loop over a list of URLs and I want to find out if these URLs exist or not.

RCurl provides the url.exists() function. However, the output doesn't seem to be right, because for example it says that amazon.com is not registered (it does so because the url.exists()-function doesn't return a value in the 200 range, in the case of amazon.com it's 405 ("method not allowed").

I also tried HEAD() and GET() provided by the httr package. But sometimes I get error messages here, for example for timeouts or because the URL is not registered.

Error messages look like this:

Error in curl::curl_fetch_memory(url, handle = handle) : Timeout was reached: Connection timed out after 10000 milliseconds

Error in curl::curl_fetch_memory(url, handle = handle) : Could not resolve host: afsadadssadasf.com

When I get such an error, the whole for loop stops. Is it possible to continue the for loop? I tried tryCatch(), but to my knowledge this can only help when the problem is in the dataframe itself.


Solution

  • pingr::ping() only uses ICMP which is blocked on sane organizational networks since attackers used ICMP as a way to exfiltrate data and communicate with command-and-control servers.

    pingr::ping_port() doesn't use the HTTP Host: header so the IP address may be responding but the target virtual web host may not be running on it and it definitely doesn't validate that the path exists at the target URL.

    You should clarify what you want to happen when there are only non-200:299 range HTTP status codes. The following makes an assumption.

    NOTE: You used Amazon as an example and I'm hoping that's the first site that just "came to mind" since it's unethical and a crime to scrape Amazon and I would appreciate my code not being brought into your universe if you are in fact just a brazen content thief. If you are stealing content, it's unlikely you'd be up front here about that, but on the outside chance you are both stealing and have a conscience, please let me know so I can delete this answer so at least other content thieves can't use it.

    Here's a self-contained function for checking URLs:

    #' @param x a single URL
    #' @param non_2xx_return_value what to do if the site exists but the
    #'        HTTP status code is not in the `2xx` range. Default is to return `FALSE`.
    #' @param quiet if not `FALSE`, then every time the `non_2xx_return_value` condition
    #'        arises a warning message will be displayed. Default is `FALSE`.
    #' @param ... other params (`timeout()` would be a good one) passed directly
    #'        to `httr::HEAD()` and/or `httr::GET()`
    url_exists <- function(x, non_2xx_return_value = FALSE, quiet = FALSE,...) {
    
      suppressPackageStartupMessages({
        require("httr", quietly = FALSE, warn.conflicts = FALSE)
      })
    
      # you don't need thse two functions if you're alread using `purrr`
      # but `purrr` is a heavyweight compiled pacakge that introduces
      # many other "tidyverse" dependencies and this doesnt.
    
      capture_error <- function(code, otherwise = NULL, quiet = TRUE) {
        tryCatch(
          list(result = code, error = NULL),
          error = function(e) {
            if (!quiet)
              message("Error: ", e$message)
    
            list(result = otherwise, error = e)
          },
          interrupt = function(e) {
            stop("Terminated by user", call. = FALSE)
          }
        )
      }
    
      safely <- function(.f, otherwise = NULL, quiet = TRUE) {
        function(...) capture_error(.f(...), otherwise, quiet)
      }
    
      sHEAD <- safely(httr::HEAD)
      sGET <- safely(httr::GET)
    
      # Try HEAD first since it's lightweight
      res <- sHEAD(x, ...)
    
      if (is.null(res$result) || 
          ((httr::status_code(res$result) %/% 200) != 1)) {
    
        res <- sGET(x, ...)
    
        if (is.null(res$result)) return(NA) # or whatever you want to return on "hard" errors
    
        if (((httr::status_code(res$result) %/% 200) != 1)) {
          if (!quiet) warning(sprintf("Requests for [%s] responded but without an HTTP status code in the 200-299 range", x))
          return(non_2xx_return_value)
        }
    
        return(TRUE)
    
      } else {
        return(TRUE)
      }
    
    }
    

    Give it a go:

    c(
      "http://content.thief/",
      "http://rud.is/this/path/does/not_exist",
      "https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=content+theft", 
      "https://www.google.com/search?num=100&source=hp&ei=xGzMW5TZK6G8ggegv5_QAw&q=don%27t+be+a+content+thief&btnK=Google+Search&oq=don%27t+be+a+content+thief&gs_l=psy-ab.3...934.6243..7114...2.0..0.134.2747.26j6....2..0....1..gws-wiz.....0..0j35i39j0i131j0i20i264j0i131i20i264j0i22i30j0i22i10i30j33i22i29i30j33i160.mY7wCTYy-v0", 
      "https://rud.is/b/2018/10/10/geojson-version-of-cbc-quebec-ridings-hex-cartograms-with-example-usage-in-r/"
    ) -> some_urls
    
    data.frame(
      exists = sapply(some_urls, url_exists, USE.NAMES = FALSE),
      some_urls,
      stringsAsFactors = FALSE
    ) %>% dplyr::tbl_df() %>% print()
    ##  A tibble: 5 x 2
    ##   exists some_urls                                                                           
    ##   <lgl>  <chr>                                                                               
    ## 1 NA     http://content.thief/                                                               
    ## 2 FALSE  http://rud.is/this/path/does/not_exist                                              
    ## 3 TRUE   https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=con…
    ## 4 TRUE   https://www.google.com/search?num=100&source=hp&ei=xGzMW5TZK6G8ggegv5_QAw&q=don%27t…
    ## 5 TRUE   https://rud.is/b/2018/10/10/geojson-version-of-cbc-quebec-ridings-hex-cartograms-wi…
    ## Warning message:
    ## In FUN(X[[i]], ...) :
    ##   Requests for [http://rud.is/this/path/does/not_exist] responded but without an HTTP status code in the 200-299 range