Search code examples
xmlrcss-selectorsrvestmagrittr

rvest scrape multiple values per node


Taking this example XML

<body>
  <items>
    <item>
      <name>Peter</name>
    </item>
  </items>
  <items>
    <item>
      <name>Paul</name>
    </item>
    <item>
      <name>Claudia</name>
    </item>
  </items>
  <items/>
</body> 

Question: What is the easiest way to get the following result?

"Peter"   "Paul"   ""

By now i achieve this as follows:

require(rvest)
require(magrittr)
my_xml <- xml("<items><item><name>Peter</name></item></items><items><item><name>Paul</name></item><item><name>Claudia</name></item></items><items></items>")
items <- my_xml %>% xml_nodes("items") %>% xml_node("item")
sapply(items, function(x){
  if(is.null(x)){
    ""
  } else {
    x %>% xml_node("name") %>% xml_text()
  }
})

To me this sapply construction seams like mistreating either rvest or css-selectors.


Solution

  • rvest really isn't needed since this is pure XML (and you end up using xml2 constructs anyway):

    library(xml2)
    
    doc <- read_xml("<body>
      <items>
        <item>
          <name>Peter</name>
        </item>
      </items>
      <items>
        <item>
          <name>Paul</name>
        </item>
        <item>
          <name>Claudia</name>
        </item>
      </items>
      <items/>
    </body>")
    
    
    sapply(xml_find_all(doc, "//items"), function(x) {
      val <- xml_text(xml_find_all(x, "./item[1]/name"))
      ifelse(length(val)>0, val, "")
    })
    
    ## [1] "Peter" "Paul"  ""     
    

    (sometimes XPath can be better than CSS)