Search code examples
rweb-scrapingrcurlrvesthttr

R: Scrape multiple urls using pipechain commands in Rvest


I have a chr list with multiple urls'. I want to download content from each of these urls'.

To avoid writing out hundreds of commands, I wish to automate the process with a loop using lapply.

However my command returns an error. Is it possible to scrape from multiple urls?

Current Approaches

Long method: Works, but I wish to automate it

urls <-c("https://en.wikipedia.org/wiki/Belarus","https://en.wikipedia.org/wiki/Russia","https://en.wikipedia.org/wiki/England")

library(rvest)
library(httr) # required for user_agent command

uastring <- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
session <- html_session("https://en.wikipedia.org/wiki/Main_Page", user_agent(uastring))
session2<-jump_to(session, "https://en.wikipedia.org/wiki/Belarus")
session3<-jump_to(session, "https://en.wikipedia.org/wiki/Russia")
writeBin(session2$response$content, "test1.txt") 
writeBin(session3$response$content, "test2.txt")

Automated/loop: Does not work.

urls <-c("https://en.wikipedia.org/wiki/Belarus","https://en.wikipedia.org/wiki/Russia","https://en.wikipedia.org/wiki/England")

library(rvest)
library(httr) # required for user_agent command

uastring <- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
session <- html_session("https://en.wikipedia.org/wiki/Main_Page", user_agent(uastring))
lapply(urls, .%>% jump_to(session))
Error: is.session(x) is not TRUE

Summary

I wish to automate the following two processes, jump_to() and writeBin(), as shown in the code below

session2<-jump_to(session, "https://en.wikipedia.org/wiki/Belarus")
session3<-jump_to(session, "https://en.wikipedia.org/wiki/Russia")
writeBin(session2$response$content, "test1.txt") 
writeBin(session3$response$content, "test2.txt")

Solution

  • You can do something like this:

    urls <-c("https://en.wikipedia.org/wiki/Belarus","https://en.wikipedia.org/wiki/Russia","https://en.wikipedia.org/wiki/England")
    require(httr)
    require(rvest)
    uastring <- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
    session <- html_session("https://en.wikipedia.org/wiki/Main_Page", user_agent(uastring))
    
    outfile <- sprintf("%s.html", sub(".*/", "", urls))
    
    jump_and_write <- function(x, url, out_file){
      tmp = jump_to(x, url)
      writeBin(tmp$response$content, out_file) 
    }
    
    for(i in seq_along(urls)){
      jump_and_write(session, urls[i], outfile[i])
    }