r text-mining gsub information-retrieval

Remove html tags from a corpus in R

I am trying to remove the html tag from a corpus (docs) in R:

tags : </P></TEXT>  </BODY> <TRAILER> NYT-06-22-98 1759EDT &QL; </TRAILER> </DOC>

The code I am using:

tun<-function(x) gsub("<TRAILER>,<HTML>,<BODY>,<P>,<TEXT>,</P>,</TEXT>,
</BODY>,</HTML>", "", x)
docs <- tm_map(docs, tun)

But its not able to remove the tags from the corpus , why is that?

Solution

If you want to remove all opening and closing HTML tags, then you may try finding the pattern </?[^>]+> and replacing with empty string:

x <- "tags : </P></TEXT>  </BODY> <TRAILER> NYT-06-22-98 1759EDT &QL; </TRAILER> </DOC>"
gsub("</?[^>]+>", "", x)


[1] "tags :     NYT-06-22-98 1759EDT &QL;  "

Demo

As a major disclaimer, in general you should not use regex to parse HTML/XML content. In this particular case, if you just want to strip off all tags, gsub may be a viable option.

Replacing with conditional value in dplyr case_when()
Calculating moving average
Estimating non-monotonic bi-exponential curve fit
column type issue when converting csv to parquet using duckdb in R
"Target position can only be set for new windows" in chromote in R
Determine level of nesting in R?
Week start on Mondays
Center output from dm_draw
plot a network based on given values
Adding a X axis title to faceted ggballoonplot
Calculate mean of matrices having different dimensions
check if two columns have a one-to-one relationship in R
How to extract Std.Dev from VarCorr glmmTMB
How do you print to stderr in R?
How to plot China map with South China Sea in base R
Get column and row position of nth element in a matrix
Is there any authoritative documentation on R release nicknames?
R Glassdoor Web Scraping
Issue with graticule across 180° for several country/territory EEZs
Separating grouped layers in a raster stack in terra
How can I use group_by and mutate to perform a subtraction calculation with specific groupings? Time 0 minus Time X for all groups
How to directly open .R data containing data frame code in R?
Way to web-scrape a popular eSport website using R?
Variance calculation warning: longer object length is not a multiple
gratia::draw(): "'length.out' must be a non-negative number"
Using Swift as custom engine in knitr and including all previous content
convert source target value dataframe into a correlation matrix
ggplot2 plotting a 100% stacked area chart
Use string as formula for ipwtm function?
interpolarization within groups with NA