Search code examples
rweb-scrapingrcrawler

Rcrawler scrape does not yield pages


I'm using Rcrawler to extract the infobox of Wikipedia pages. I have a list of musicians and I'd like to extract their name, DOB, date of death, instruments, labels, etc. Then I'd like to create a dataframe of all artists in the list as rows and the data stored as columns/vectors.

The code below throws no errors but I don't get any results either. The xpath used in the code is effective when I use rvest on its own.

What is wrong with my code?

library(Rcrawler)
jazzlist<-c("Art Pepper","Horace Silver","Art Blakey","Philly Joe Jones")

Rcrawler(Website = "http://en.wikipedia.org/wiki/Special:Search/", no_cores = 4, no_conn = 4, 
     KeywordsFilter = jazzlist,
     ExtractXpathPat = c("//th","//tr[(((count(preceding-sibling::*) + 1) = 5) and parent::*)]//td",
                         "//tr[(((count(preceding-sibling::*) + 1) = 6) and parent::*)]//td"),
     PatternsNames = c("artist", "dob", "dod"), 
     ManyPerPattern = TRUE, MaxDepth=1 )

Solution

  • I could be wrong, but I suspect you are thinking that Rcrawler package works differently from what it does. You may be confusing scraping with crawling.

    Rcrawler simply starts from a given page and crawls any link out from that page. You can narrow down the paths using URL filters, or Keyword Filters as you have done, but it will still need to reach those pages via a crawling process. It doesn't run a search.

    The fact you've started from a Wikipedia search page suggests you might be expecting it to run searches on the terms you've specified in jazzlist, but it won't do this. It will simply follow all links out from the Wikipedia search page, e.g. 'Main Pages', 'Content', 'Featured Content' from the left sidebar, and it may or may not eventually hit upon one of the terms you've used, in which case it will scrape data according to your xpath parameters.

    The terms you've specified are going to be very rare, so while it will probably find them eventually via article cross-links from, say, 'Featured Pages', it will take an extremely long time.

    What I think you want instead is to not use Rcrawler at all, but to call rvest functions from within a loop on your search terms. You just need to append the terms to the search URL you mentioned, and replace space with underscore:

    library(rvest)
    target_pages = paste0('https://en.wikipedia.org/wiki/Special:Search/', gsub(" ", "_", jazzlist))
    
    for (url in target_pages){
        webpage = read_html(url)
        # do whatever else you want here with rvest functions 
    }
    

    Edit: Added solution below with OP's exact code for his specific case, as per his comment

    library(rvest)
    target_pages = paste0('https://en.wikipedia.org/wiki/Special:Search/', gsub(" ", "_", jazzlist))
    
    for (url in target_pages){
        webpage = read_html(url)
        info<-webpage %>% html_nodes(xpath='//*[contains(concat( " ", @class, " " ), concat( " ", "plainlist", " " ))]') %>% html_text() temp<-data.frame(info, stringsAsFactors = FALSE) data = rbind(data,temp) 
    }