Search code examples
rrcurl

RCurl, error: connection time out


I use the XML and RCurl packages of R to get data from a website. The script needs to scrap 6,000,000 pages, so I created a loop.

for (page in c(1:6000000)){

 my_url = paste('http://webpage.....')
 page1 <- getURL(my_url, encoding="UTF-8")
 mydata <- htmlParse(page1, asText=TRUE, encoding="UTF-8")
 title <- xpathSApply(mydata, '//head/title', xmlValue, simplify = TRUE, encoding="UTF-8")

.....
.....
.....}

However, after a few loops I get the error message:

Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) : connection time out

The problem is that I don't understand how the "time out" works. Sometimes the process ends after 700 pages while other times after 1000, 1200 etc pages. The step is not stable. When the connection is timed out, I can't access this webpage from my laptop, for 15 minutes. I thought of using a command to delay the process for 15 minutes every 1000 pages scrapped

if(page==1000) Sys.sleep(901)

, but nothing changed.

Any ideas what is going wrong and how to overcome this?


Solution

  • I solved it. Just added Sys.sleep(1) to each iteration.