Search code examples
rfor-loopxpathweb-scrapingparallel.foreach

for loop to check if list of xpaths exists in R


I have a list of html_nodes , and I want to check all whether they exist in a page and return 1 if yes, and 0 if it does not exits.

I have tried "if" function for each node manually, but as they may change over time, I needed to scrape all available nodes from the entire website and check each node on each page.

What I have

data<-foreach(i=urls) %dopar% {
node1 <- read_html(i) %>% html_nodes(xpath = node1) %>%  html_text()
if (length(node1)>0){
node1<-1
} else{
node1<-0
}
node2 <- read_html(i) %>% html_nodes(xpath = node2) %>%  html_text()
if (length(node1)>0){
node2<-1
} else{
node2<-0
}
}

I need something similar to this (intuition):

data<-foreach(i=urls) %dopar% {
for (j in nodes) {  
node <- read_html(i) %>% html_nodes(xpath = j) %>%  html_text()
if (length(node)>0){
node<-1
} else{
node<-0
}
}
}

Solution

  • You were almost there, you need indeed a loop for your nodes. sapply and co is your friend:

    data <- foreach(i=urls) %dopar% {
       sapply(nodes, function(j)  
           length(read_html(i) %>% html_nodes(xpath = j) %>%  html_text()) > 0)
    }