Search code examples
rrvestxml2

R {xml_node} to plain text while preserving the tags?


I'd like to do exactly what xml2::xml_text() or rvest::html_text() do but preserve the tags instead of replacing e.g. <br> with \n. The objective is to e.g. scrape a web page, extract the nodes I want, and store the plain HTML in a variable, much like write_html() would store it in a file.

How can I do this?


Solution

  • Ironically, it turns out that as.character() works just fine.

    Therefore:

    library(rvest)
    html <- read_html("http://stackoverflow.com")
    
    res <– html %>%
             html_node("h1") %>%
             as.character()
    
    > res
    
    [1] "<h1 class=\"-title\">Learn, Share, Build</h1>"
    

    This is the desired output in my current use case.

    On the other hand, for comparison if one needs to strip the tags:

    res <- html %>%
             html_node("h1") %>%
             html_text()
    
    > res
    [1] "Learn, Share, Build"