Search code examples
rtext-mininggsubinformation-retrieval

Remove html tags from a corpus in R


I am trying to remove the html tag from a corpus (docs) in R:

tags : </P></TEXT>  </BODY> <TRAILER> NYT-06-22-98 1759EDT &QL; </TRAILER> </DOC> 

The code I am using:

tun<-function(x) gsub("<TRAILER>,<HTML>,<BODY>,<P>,<TEXT>,</P>,</TEXT>,
</BODY>,</HTML>", "", x)
docs <- tm_map(docs, tun)

But its not able to remove the tags from the corpus , why is that?


Solution

  • If you want to remove all opening and closing HTML tags, then you may try finding the pattern </?[^>]+> and replacing with empty string:

    x <- "tags : </P></TEXT>  </BODY> <TRAILER> NYT-06-22-98 1759EDT &QL; </TRAILER> </DOC>"
    gsub("</?[^>]+>", "", x)
    
    
    [1] "tags :     NYT-06-22-98 1759EDT &QL;  "
    

    Demo

    As a major disclaimer, in general you should not use regex to parse HTML/XML content. In this particular case, if you just want to strip off all tags, gsub may be a viable option.