I am trying to remove the html tag from a corpus (docs) in R:
tags : </P></TEXT> </BODY> <TRAILER> NYT-06-22-98 1759EDT &QL; </TRAILER> </DOC>
The code I am using:
tun<-function(x) gsub("<TRAILER>,<HTML>,<BODY>,<P>,<TEXT>,</P>,</TEXT>,
</BODY>,</HTML>", "", x)
docs <- tm_map(docs, tun)
But its not able to remove the tags from the corpus , why is that?
If you want to remove all opening and closing HTML tags, then you may try finding the pattern </?[^>]+>
and replacing with empty string:
x <- "tags : </P></TEXT> </BODY> <TRAILER> NYT-06-22-98 1759EDT &QL; </TRAILER> </DOC>"
gsub("</?[^>]+>", "", x)
[1] "tags : NYT-06-22-98 1759EDT &QL; "
As a major disclaimer, in general you should not use regex to parse HTML/XML content. In this particular case, if you just want to strip off all tags, gsub
may be a viable option.