Search code examples
rtidyverservestxml2

Extract more than one type of element whilst preserving order using rvest (or similar) in R?


I'm attempting to extract elements matching 2 different types in an HTML document, whilst preserving the order.

It's straight forward to extract each element type separately (see example), but I cannot work out how to extract them both in one go and preserve the order in which they appear in the web page.

Minimal example

Here's some dummy HTML

dummy_html <- "<p>hi there</p>
<p>2nd para</p>
<div>unwanted stuff</div>
<span>something new</span>
<p>3rd para</p>
<span>extra stuff</span>
<div>more unwanted stuff</div>
<p>4th para</p>"

Suppose we wish to extract all the p elements and all the span elements (and maintain the order in which they appear)

# p elements on their own
library(rvest)
dummy_html %>% read_html %>% html_nodes("p")

{xml_nodeset (4)}
[1] <p>hi there</p>
[2] <p>2nd para</p>
[3] <p>3rd para</p>
[4] <p>4th para</p>

# span elements on their own
dummy_html %>% read_html %>% html_nodes("span")
{xml_nodeset (2)}
[1] <span>something new</span>
[2] <span>extra stuff</span>

But how can we extract all of either element? i.e. all the p elements and all the span elements together so that the desired output is:

{xml_nodeset (6)}
[1] <p>hi there</p>
[2] <p>2nd para</p>
[3] <span>something new</span>
[4] <p>3rd para</p>
[5] <span>extra stuff</span>
[6] <p>4th para</p>

Note the preservation of the order (i.e. p and span interspliced)

What I've tried so far

I tried the obvious dummy_html %>% read_html %>% html_nodes("span|p") but it throws an error.


Solution

  • You can do it with either CSS or XPath syntax; your CSS just needed a , instead of a |:

      library(rvest)
    #> Loading required package: xml2
    
      dummy_html <- "<p>hi there</p>
    <p>2nd para</p>
    <div>unwanted stuff</div>
    <span>something new</span>
    <p>3rd para</p>
    <span>extra stuff</span>
    <div>more unwanted stuff</div>
    <p>4th para</p>"
    
      # With CSS
      dummy_html %>% read_html() %>% html_nodes("p,span")
    #> {xml_nodeset (6)}
    #> [1] <p>hi there</p>
    #> [2] <p>2nd para</p>
    #> [3] <span>something new</span>
    #> [4] <p>3rd para</p>
    #> [5] <span>extra stuff</span>
    #> [6] <p>4th para</p>
      # With XPath
      dummy_html %>% read_html() %>% html_nodes(xpath = "//span | //p")
    #> {xml_nodeset (6)}
    #> [1] <p>hi there</p>
    #> [2] <p>2nd para</p>
    #> [3] <span>something new</span>
    #> [4] <p>3rd para</p>
    #> [5] <span>extra stuff</span>
    #> [6] <p>4th para</p>
    

    Created on 2019-10-19 by the reprex package (v0.3.0)

    Thanks to QHarr for pointing out the (neater) CSS option!