Search code examples
rxmllinuxweb-scrapingrcurl

getURL working slow


I am extracting information from various databases, and to accomplish that I am keeping track of how to convert between the different IDs for each database.

library("RCurl")
library("XML")
transformDrugId<-function(x){
URLtoan<-getURL(x)
PARSED<-htmlParse(URLtoan)
dsource<-xpathSApply( PARSED,"//*[@id='advancedform']/div[7]/fieldset/p/b[1]/text()",xmlValue)
id<-xpathSApply( PARSED,"//*[@id='advancedform']/div[7]/fieldset/p/a[1]/span/text()",xmlValue)
return(c(dsource,id))}  

And just as an example the time that it takes on my PC using linux and RSTUDIO is

system.time(DBidstest<-sapply(urls[c(10001:10003)],transformDrugId))
 user  system elapsed 
0.132   0.000   3.675 

system.time(DBids7<-sapply(urls[c(601:700)],transformDrugId))
user  system elapsed 
3.980   0.124 549.233 

Where urls contain the list of url adresses of the TDR database where I check for IDs The computation time becomes prohibitively long when I have to do this for the 300000 drug IDs. As an example I provide the first five urls

head(urls)
[1] "http://tdrtargets.org/drugs/view?mol_id=608858"
[2] "http://tdrtargets.org/drugs/view?mol_id=608730"
[3] "http://tdrtargets.org/drugs/view?mol_id=549548"
[4] "http://tdrtargets.org/drugs/view?mol_id=581648"
[5] "http://tdrtargets.org/drugs/view?mol_id=5857"  
[6] "http://tdrtargets.org/drugs/view?mol_id=550626"

Any help that might help in reducing the time to get and analyse the htmls will be apreciated. I am open to any suggestion that might involve not using R.

I have later realized that using getURLAsynchronous for 10 or less URL is sometimes faster, but using it twice becomes slower

system.time(test<-getURLAsynchronous(urls[c(1:10)]))
user  system elapsed 
0.128   0.016   1.414 
system.time(test<-getURLAsynchronous(urls[c(1:10)]))
user  system elapsed 
0.152   0.088 300.103

Solution

  • Downloading directly using the shell resulted ten times faster echo $URLTEST| xargs -n 1 -P 7 wget -q where URLTEST is a list of htmls to download.-n sets the waiting time between queries and -P the number of parallel queries, both where fine tuned so that for 100 htmls I got real 0m13.498s user 0m0.196s sys 0m0.652s

    There must be some problem in how R's interface t libcurl, that makes it really slow in comparison both for getURL() and downloadFile()