Search code examples
htmlrencodingcharacter-encoding

convert HTML Character Entity Encoding in R


Is there a way in R to convert HTML Character Entity Encodings?

I would like to convert HTML character entities like & to & or > to >

For Perl exists the package HTML::Entities which could do that, but I couldn't find something similar in R.

I also tried iconv() but couldn't get satisfying results. Maybe there is also a way using the XML package but I haven't figured it out yet.


Solution

  • Update: this answer is outdated. Please check the answer below based on the new xml2 pkg.


    Try something along the lines of:

    # load XML package
    library(XML)
    
    # Convenience function to convert html codes
    html2txt <- function(str) {
          xpathApply(htmlParse(str, asText=TRUE),
                     "//body//text()", 
                     xmlValue)[[1]] 
    }
    
    # html encoded string
    ( x <- paste("i", "s", "n", "&", "a", "p", "o", "s", ";", "t", sep = "") )
    [1] "isn&apos;t"
    
    # converted string
    html2txt(x)
    [1] "isn't"
    

    UPDATE: Edited the html2txt() function so it applies to more situations