Search code examples
rweb-scrapingpurrrrvestxml2

How do I put xml-nodesets (created with rvest) into a tibble using purrr's map-function in R?


I want to scrape a large amount of websites. For this, I first read in the websites' html-scripts and store them as xml_nodesets. As I only need the websites' contents, I lastly extract each websites' contents from the xml_nodesets. To achieve this, I have written following code:

# required packages
library(purrr)
library(dplyr)
library(xml2)
library(rvest)
    
# urls of the example sources
test_files <- c("https://en.wikipedia.org/wiki/Web_scraping", "https://en.wikipedia.org/wiki/Data_scraping")
        
# reading in the html sources, storing them as xml_nodesets
test <- test_files %>% 
map(., ~ xml2::read_html(.x, encoding = "UTF-8"))
        
# extracting selected nodes (contents)
test_tbl <- test %>%
     map(., ~tibble(
     # scrape contents
     test_html = rvest::html_nodes(.x, xpath = '//*[(@id = "toc")]')  
            ))

Unfortunately, this produces following error:

Error: All columns in a tibble must be vectors.
x Column `test_html` is a `xml_nodeset` object.

I think I understand the substance of this error, but I can't find a way around it. It's also a bit strange, because I was able to smoothly run this code in January and suddenly it is not working anymore. I suspected package updates to be the reason, but installing older versions of xml2, rvest or tibble didn't help either. Also, scraping only one single website doesn't produce any errors either:

test <- read_html("https://en.wikipedia.org/wiki/Web_scraping", encoding = "UTF-8") %>%
  rvest::html_nodes(xpath = '//*[(@id = "toc")]')

Do you have any suggestions on how to solve this issue? Thank you very much!

EDIT: I removed %>% html_text from ...

test_tbl <- test %>%
     map(., ~tibble(
     # scrape contents
     test_html = rvest::html_nodes(.x, xpath = '//*[(@id = "toc")]')  
            ))

... as this doesn't produce this error. The edited code does, though.


Solution

  • You need to store the objects in a list.

    test %>%
      purrr::map(~tibble(
        # scrape contents
        test_html = list(rvest::html_nodes(.x, xpath = '//*[(@id = "toc")]'))  
      ))
    
    #[[1]]
    # A tibble: 1 x 1
    #  test_html 
    #  <list>    
    #1 <xml_ndst>
    
    #[[2]]
    # A tibble: 1 x 1
    #  test_html 
    #  <list>    
    #1 <xml_ndst>