Search code examples
rurlweb-scrapingdownloadrcurl

How can I stop url.exists()?


I have a list of PDF URLs, and I want to download these PDFs. However, not all of the URLs are still existing, this is why I check them before by means of the RCurl function url.exists(). With some URLs, however, this function is running forever without delivering a result. I can't even stop it with a withTimeout() function.

I wrapped url.exists() into withTimeout(), but the timeout does not work:

library(RCurl)
library(R.utils)
url <- "http://www.shangri-la.com/uploadedFiles/corporate/about_us/csr_2011/Shangri-La%20Asia%202010%20Sustainability%20Report.pdf"
withTimeout(url.exists(url), timeout = 15, onTimeout = "warning")

The function runs forever, timeout is ignored.

Thus my questions:

  • Is there any possible check which sorts out this URL before it gets to url.exists()?
  • Or is there a possibility to prevent url.exists() from running forever?

Other checks I tried (but which do not sort out this URL) are:

try(length(getBinaryURL(url))>0) == T
http_status(GET(url))
!class(try(GET(url]))) == "try-error"

Solution

  • library(httr)
    
    urls <- c(
      'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-1&unit=SLE010', 
      'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=HMM202',
      'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=SLE339'
    )
    
    sapply(urls, url_success, config(followlocation = 0L), USE.NAMES = FALSE)

    This functions is analogous to file.exists and determines whether a request for a specific URL responds without error. We make the request but ask the server not to return the body. We just process the header.